Skip to main content
Back to Journal
user@argobox:~/journal/2026-02-17-crash-recovery-infrastructure
$ cat entry.md

Crash, Recovery, and Infrastructure — Dev Workspace, Torrent Controls, Firewall, Resources

○ NOT REVIEWED

The PC crashed at 97.6% through the RAG embedding build. Killed Ollama, killed the embedding process (1,122 chunks left), killed the scheduled commit. Opened the Claude Code conversation logs, found where I'd left off, and didn't stop until everything was recovered and the site was better than before.

Crash Recovery (45 minutes)

Crash happened during a long-running build-blog-rag embedding job. The script is idempotent—it tracks completion and resumes from the last checkpoint. Fired it up again: npx tsx scripts/build-blog-rag.ts. Final result: 46,429 chunks, 100% embedded, 401.9 MB index.

Made Ollama persistent by creating /etc/init.d/ollama (OpenRC service script), added to default runlevel. Now it survives reboots and crashes.

Committed 5 files that were waiting: probe studio exports, auth logout redirect, resilient swarm polling, orchestrator fixes. Pushed to both Gitea and GitHub.

Tested Argonaut end-to-end. Found two bugs in the writer page (model selector parsing, SSE content display). Fixed both.

Torrent Controls (3.5 hours)

The workflow health check was making 148 Gitea API calls (one per content file), blowing past Cloudflare Workers' 50-subrequest limit. Switched to getCollection() .body directly—problem solved.

The admin homelab page used Galactic Identity sanitized names even though it's behind Cloudflare Access (no public exposure). Unsanitized them:

  • Meridian-Host → MasaiMara
  • Altair-Link → argobox-lite
  • Izar-Host → Io
  • Tarn-Host → Titan

Built rt-controller: 256-line Python API on MasaiMara, talks to Docker socket, executes s6-svc commands inside ruTorrent containers. Controls nginx/php-fpm (web UI) and rtorrent (daemon) independently. Deployed at 127.0.0.1:8079, tunneled via rt-control.argobox.com.

Built Astro proxy at /api/admin/rt-control/[...path].ts and a Torrent Control Panel in homelab.astro with toggle switches per container (Argonaut and Bogie instances). Locked down both ruTorrent containers to localhost binding—HTTP only through cloudflared tunnel, DHT/peer ports still open for actual torrent traffic.

Updated Cloudflare tunnel to route rt-control.argobox.com.

Bugs fixed: s6-svstat parsing error (switched from string split to regex), scp connection drops (used pipe approach).

Dev Workspace Infrastructure (2.5 hours)

The workbench has three mode tabs: Brain (working), Code (placeholder), Terminal (placeholder). This session implemented Code and Terminal modes for real.

Architecture: ttyd (web terminal server) runs inside a Docker container. Each WebSocket connection gets its own bash shell. No session management needed—always-on container.

Browser (workbench) → WebSocket → ttyd:7681 → bash shell
                                              ├─ claude (Claude Code CLI)
                                              ├─ codex (OpenAI Codex CLI)
                                              └─ opencode

Container built (docker/dev-workspace/):

  • Base: node:22-bookworm-slim
  • System: git, tmux, curl, openssh, python3, build-essential, jq, vim
  • ttyd 1.7.7 binary from GitHub releases
  • Claude Code, Codex, OpenCode CLIs

Entrypoint.sh:

  • Configures git user.name/email
  • Marks /workspace as safe
  • Prints banner with available tools
  • Starts ttyd on port 7681 with --writable --max-clients 10

Workbench Code Mode: iframe to VS Code Server (https://dev.argobox.com), lazy-loaded, reload and "open in new tab" buttons.

Workbench Terminal Mode: Multi-tab xterm.js interface. Tab bar: Shell, Claude Code, Codex, OpenCode. Each tab owns its own WebSocket connection to ttyd. Custom ttyd binary protocol implementation (ASCII type bytes: '0' for input, '1' for resize).

Infrastructure on argobox-lite:

  • Created /home/argo/Development/ directory
  • Cloned argobox repo via token auth
  • Built Docker image (~100s, cached layers)
  • Started container: argobox-dev-workspace on port 7681

Bugs encountered and fixed:

  • gitconfig mount crash loop: entrypoint writes to it but :ro mount prevented that. Removed mount, entrypoint handles config itself.
  • ttyd --title flag doesn't exist in 1.7.7: was concatenating the title into the start command. Removed the flag.
  • ttyd binary protocol: was sending raw bytes (0x00, 0x02), but ttyd uses ASCII characters ('0', '1'). Fixed.

Remaining:

  • Add dev-term.argobox.com to Cloudflare tunnel (manual)
  • Set API keys (ANTHROPIC_API_KEY, OPENAI_API_KEY) in container env
  • Test full flow: workbench terminal → ttyd → AI tools → git push → CF Pages deploy

Firewall Hardening (26 minutes)

Playground expansion session continued with network security. Both Proxmox nodes had NO lab firewall rules. Lab containers on vmbr99 had unrestricted internet access.

Node io (100.99.141.70):

  • Had permanent MASQUERADE rule for 10.99.0.0/24 → vmbr0 (full NAT internet)
  • Removed it, applied setup-lab-firewall.sh

Node titan (100.68.107.42):

  • Had leaked temporary rules (MASQUERADE for 10.99.0.15, RELATED/ESTABLISHED for 10.99.0.24)
  • Removed, applied setup-lab-firewall.sh

Final firewall state on both nodes:

  • ebtables: DROP all inter-container L2 on vmbr99
  • iptables FORWARD: ACCEPT container→gateway (10.99.0.1), DROP everything else
  • iptables INPUT: DROP container→Proxmox management (ports 8006, 22)
  • No persistent MASQUERADE for lab subnet

Temp NAT (for setup-time package installs) still works because lab engine's iptables -I FORWARD 1 takes priority over appended DROPs, and temp rules clean up in a finally block.

Rules persisted via netfilter-persistent save (titan) and iptables-save (io).

Resources Deep Expansion (2.5 hours)

Same session, different agents. Two rounds of 8 parallel agents expanding all 8 resource pages from thin placeholders to 9,084 lines of production content.

Round 1 (Commit 6f67d28): Breadth pass. Eight agents created 2 new pages and expanded 6 existing ones. ~5x content baseline.

Round 2 (Commit 3425fca): Depth pass. Eight agents deeply rewrote every entry with production-quality configurations, working code, inline comments.

The 8 pages:

  1. Cheatsheets: 12 deep sheets, 18-27 commands each
  2. Kubernetes: 21 manifests (Deployments, Services, Ingress, PVCs, ConfigMaps, operators, best practices)
  3. Docker Compose: 23 full stack definitions (monitoring, media, databases, reverse proxies, dev tools)
  4. Comparisons: 10 comparison tables (container runtimes, init systems, filesystems, proxies, CI/CD, monitoring, DNS)
  5. Tutorials: 8 playbooks (SSH hardening, Tailscale mesh, Docker, ZFS, Prometheus, backups)
  6. Config Files: 21 production configs (Fail2ban, Prometheus, SSH, Unbound, Caddy, sysctl, Alertmanager, Grafana, Loki+Promtail)
  7. Stack: 36 services across 10 categories (networking, monitoring, storage, media, dev, security, databases, CI/CD, communication, utilities)
  8. IaC: 17 infrastructure templates (Docker cleanup, Restic backup, Ansible roles/inventory/vault, Terraform, Packer, OpenRC)

All IPs use sanitized 10.42.0.x format per the Galactic Identity system. Dates show February 2026. Each config is tested and working.

Build verification: All 8 agents independently ran npm run build—all passed, no errors.

Commits:

  • 6f67d28: feat(resources): massive expansion — 2 new pages, 6 expanded, ~5x content
  • 3425fca: feat(resources): deepen all 8 playground pages with production-quality content
  • Stats: 8 files changed, +9,084 insertions, -398 deletions

Decisions: Two rounds (breadth then depth) rather than one massive pass. 8 parallel agents per round maximized throughput. Committed only the 8 resource pages, left 30+ unrelated modified files unstaged.

The Day in Numbers

  • Crashed at 97.6% completion
  • Recovered and finished RAG embedding (100%)
  • 1 init script created (Ollama persistence)
  • 3 files modified for workflow + homelab
  • 1 Python API built (rt-controller)
  • 1 Docker container built and deployed (dev-workspace)
  • 2 Proxmox nodes hardened (firewall)
  • 9,084 lines of content written (resources)
  • 16 commits total

Didn't sleep. Didn't plan to work past 9 AM. One crash and one voice that said "I want these to be actual containers" led to a 15-hour day that shipped dev workspace infrastructure, network security, and a resource library.

This is the kind of day where the system takes a hit and comes back better.

Built 02:00 crash → 22:00 finish. Committed throughout.