Skip to main content
Back to Journal
user@argobox:~/journal/2026-03-09-the-recovery-script-meets-production
$ cat entry.md

The Recovery Script Meets Production

○ NOT REVIEWED

The Recovery Script Meets Production

The Setup

Eight days ago, the Gentoo system crashed and lost authentication. March 9, I finally have the recovery script ready. It's 28KB. It has eight phases. It took three sessions to build because the first two ran out of context.

But today is the day we deploy it.

Also, I need to refactor two ArgoBox modules. Also, the RAG embedding pipeline needs restarting because of a system reboot. Also, the Colorado legal RAG is autonomously building an index on Titan CT 103.

It's one of those days where the metaphorical plates are actually spinning and you have to keep them all moving.

The Recovery Script Anatomy

Eight phases:

  1. Diagnostics — capture system state before changing anything. User info, environment, installed packages, PAM config. Write a full baseline so if things get worse, I have something to compare against.

  2. Process cleanup — the March 2 crash left D-state processes stuck in kernel-land. Hunt for them. Kill them if possible.

  3. libxcrypt rebuild — the core fix. Recompile libxcrypt with SHA512-CRYPT support. Takes about an hour. If it times out, fall back to manual configure/make/install.

  4. Password reset — set a new password non-interactively. Don't use the passwd command (it's interactive and might hang on a broken TTY). Write the hash directly to /etc/shadow using OpenSSL.

  5. PAM verification — test that PAM can actually authenticate with the new password. Use su - argo as a canary.

  6. Display manager — re-enable SDDM in the default runlevel so the system boots to the GUI again.

  7. System state snapshot — capture process list, memory, disk usage, dmesg. Compare with phase 1 diagnostics. Did the recovery succeed?

  8. Summary — human-readable report of what was done and what failed.

The Logging System

Five separate log files:

/tmp/gentoo-recovery-logs-20260309-143022/
├── main.log          # Full execution timeline
├── diagnostics.log   # System state (before)
├── system-state.log  # System state (after)
├── error-bundle.txt  # Summary of failures
└── pivots.log        # "We tried X, it failed, trying Y instead"

pivots.log is the MVP. When a step fails, the script doesn't just log "FAILED." It logs what failed, why, and what it's trying as a fallback. "libxcrypt emerge timed out after 3600s, falling back to manual configure/make/install." That kind of information is gold if the script hangs and I have to investigate why.

The Timeouts

Every operation has a timeout. Default 5 minutes. Emerge operations get 1 hour. If something exceeds its timeout, the script kills the process and moves to the next phase.

This is critical because the March 2 crash was caused by processes hanging indefinitely. A recovery script that also hangs indefinitely is worse than no script at all.

Non-interactive everything. No "press Enter to continue." No interactive password prompts. The TTY after a crash can be unreliable — stdin might not respond. Passwords are set via direct shadow file editing. Configuration changes use sed and echo, not interactive editors.

The Deployment Day

March 9, the script is ready. The system is ready. Time to boot Gentoo and run it.

(The actual execution will be March 10. But the preparation, the testing, the "is everything in place?" — that's today.)

I also have to:

  1. Restart RAG embeddings — the system reboot killed the background jobs. The embeddings for qwen06b and qwen8b models are only ~20% complete. Need to restart them in parallel, let them run overnight.

  2. Colorado legal RAG — running autonomously on Titan CT 103. 614K chunks of case law indexed in 6+ hours. Still indexing federal statutes. Check the logs, make sure it's progressing, no manual intervention needed.

  3. ArgoBox modules — two extractions to finish:

    • Deployments module — refactored to 3-tier architecture (core logic + adapters + thin API routes). 40% code reduction. 8 files modified, 5 new files created.
    • Argonaut module — the AI agent admin extracted into the same pattern. 27 files. Tier 1 core, Tier 2 adapters, thin Astro pages. API routes down 34% in lines.

All three of these things need to happen.

The Deployments Module

This one's mechanical but important. The old code had the same logic repeated across 4 API routes. Sync with Gitea, sync with Cloudflare, handle errors, return response. Same pattern, same error handling, same pitfalls everywhere.

Created:

  • lib/deployments-sync/gitea.ts — core Gitea API (clone repos, list branches, etc.)
  • lib/deployments-sync/cloudflare.ts — core CF API (list deployments, get build logs)
  • lib/deployments-sync/adapters/argobox.ts — auth adapter (credential lookup, token refresh)
  • lib/deployments-sync/types.ts — shared interfaces

Now the 4 API routes just call these functions. Less code. No duplication. Testable.

Refactored the endpoints, committed it, pushed to Gitea. The CF Pages build got triggered immediately. Should be live by tomorrow.

The Argonaut Module

Bigger scope. The Argonaut AI agent has:

  • Chat API (talk to the agent)
  • RAG search (query vector/text databases)
  • Task management (create, filter, sort tasks)
  • Voice profiles (manage AI voice config)
  • Admin audit (audit trail of agent actions)

Extracted all of it following the pattern from api-credentials and deployments:

  • Tier 1: core logic with zero framework imports
  • Tier 2: adapters for ArgoBox vs standalone
  • Thin API routes that delegate to Tier 1

The extraction uncovered two missing API routes — models.ts and voice-score.ts — that were referenced in the config but didn't exist as separate files. Now they do. Now they serve static data from Tier 1.

Line reductions:

  • chat.ts: 319 → 49 lines (85% reduction)
  • status.ts: 85 → 32 lines (62%)
  • tasks.ts: 59 → 30 lines (49%)

Same functionality, way less surface area. Easier to test. Easier to port to other frameworks.

The Confluence

This is what a high-velocity day looks like: three parallel workstreams, all moving simultaneously.

  • Recovery infrastructure (the script) — months of thought, weeks of implementation, ready to deploy.
  • Long-running data processing (RAG embeddings, Colorado legal indexing) — invisible, autonomous, just needs monitoring.
  • Module refactoring (deployments, argonaut) — the kind of work that feels small until you ship it and realize how much cleaner the codebase is.

All of this happens in one day because the systems are designed to run parallel. I don't wait for the embeddings to finish before extracting modules. I don't wait for the deployments refactor to finish before working on argonaut. I restart the embeddings, check the Colorado RAG logs, and write code while the GPU crunches vectors and the Titan server runs Python indexing jobs.

What's Next

March 10: boot Gentoo, run the recovery script, fix libxcrypt, reset password, re-enable display manager. If everything works, the system is back online after 14 days.

Also March 10 (probably): verify that the CF Pages build succeeded. Check that the deployments and argonaut modules are working correctly in the live site.

Also March 10 (probably): check embeddings progress. If they've completed, verify database integrity and start using them in search.

Also March 10 (probably): check Colorado legal RAG progress. If indexing is done, build the query API and test it.

It's the kind of day where you finish one thing and five other things are waiting. But that's what operational velocity looks like when you've got infrastructure that can run itself.

Tomorrow, everything gets fixed at once.

Or it doesn't, and I have logs to figure out why.

Either way, I'll know.