The Recovery Script Meets Production

The Setup

Eight days after a Gentoo workstation authentication failure, the recovery workflow was ready. It's 28KB. It has eight phases. It took three sessions to design, review, and harden because the goal was not a one-off fix; the goal was a repeatable operator runbook.

But today is the day we deploy it.

Also, I need to refactor two ArgoBox modules. Also, the RAG embedding pipeline needs restarting because of a system reboot. Also, the Colorado legal RAG is autonomously building an index on Titan CT 103.

It's one of those days where the metaphorical plates are actually spinning and you have to keep them all moving.

The Recovery Script Anatomy

Eight phases:

Diagnostics — capture system state before changing anything. User info, environment, installed packages, PAM config. Write a full baseline so every change has a reference point.
Process cleanup — detect D-state processes and clear what can be cleared safely before authentication repair begins.
libxcrypt rebuild — the core fix. Recompile libxcrypt with SHA512-CRYPT support. Takes about an hour. If it times out, fall back to manual configure/make/install.
Credential reset — reset credentials non-interactively from the recovery environment so the workflow does not depend on a fragile TTY session.
PAM verification — test that authentication succeeds after the reset using a controlled canary login.
Display manager — re-enable SDDM in the default runlevel so the system boots to the GUI again.
System state snapshot — capture process list, memory, disk usage, dmesg. Compare with phase 1 diagnostics. Did the recovery succeed?
Summary — human-readable report of what was done and what failed.

The Logging System

Five separate log files:

/tmp/gentoo-recovery-logs-20260309-143022/
├── main.log          # Full execution timeline
├── diagnostics.log   # System state (before)
├── system-state.log  # System state (after)
├── error-bundle.txt  # Summary of failures
└── pivots.log        # "We tried X, it failed, trying Y instead"

pivots.log is the MVP. When a step fails, the script doesn't just log "FAILED." It logs what failed, why, and what it's trying as a fallback. "libxcrypt emerge timed out after 3600s, falling back to manual configure/make/install." That kind of information is gold if the script hangs and I have to investigate why.

The Timeouts

Every operation has a timeout. Default 5 minutes. Emerge operations get 1 hour. If something exceeds its timeout, the script kills the process and moves to the next phase.

This is critical because recovery tooling should be bounded and observable. A repair workflow that can hang indefinitely is not a workflow; it's another unknown.

Non-interactive everything. No "press Enter to continue." No interactive credential prompts. The TTY after a crash can be unreliable, so the recovery path is designed to run from a bounded, logged recovery environment.

The Deployment Day

March 9, the script is ready. The system is ready. Time to boot Gentoo and run it.

(The actual execution will be March 10. But the preparation, the testing, the "is everything in place?" — that's today.)

I also have to:

Restart RAG embeddings — the system reboot killed the background jobs. The embeddings for qwen06b and qwen8b models are only ~20% complete. Need to restart them in parallel, let them run overnight.
Colorado legal RAG — running autonomously on Titan CT 103. 614K chunks of case law indexed in 6+ hours. Still indexing federal statutes. Check the logs, make sure it's progressing, no manual intervention needed.
ArgoBox modules — two extractions to finish:
- Deployments module — refactored to 3-tier architecture (core logic + adapters + thin API routes). 40% code reduction. 8 files modified, 5 new files created.
- Argonaut module — the AI agent admin extracted into the same pattern. 27 files. Tier 1 core, Tier 2 adapters, thin Astro pages. API routes down 34% in lines.

All three of these things need to happen.

The Deployments Module

This one's mechanical but important. The old code had the same logic repeated across 4 API routes. Sync with Gitea, sync with Cloudflare, handle errors, return response. Same pattern, same error handling, same pitfalls everywhere.

Created:

lib/deployments-sync/gitea.ts — core Gitea API (clone repos, list branches, etc.)
lib/deployments-sync/cloudflare.ts — core CF API (list deployments, get build logs)
lib/deployments-sync/adapters/argobox.ts — auth adapter (credential lookup, token refresh)
lib/deployments-sync/types.ts — shared interfaces

Now the 4 API routes just call these functions. Less code. No duplication. Testable.

Refactored the endpoints, committed it, pushed to Gitea. The CF Pages build got triggered immediately. Should be live by tomorrow.

The Argonaut Module

Bigger scope. The Argonaut AI agent has:

Chat API (talk to the agent)
RAG search (query vector/text databases)
Task management (create, filter, sort tasks)
Voice profiles (manage AI voice config)
Admin audit (audit trail of agent actions)

Extracted all of it following the pattern from api-credentials and deployments:

Tier 1: core logic with zero framework imports
Tier 2: adapters for ArgoBox vs standalone
Thin API routes that delegate to Tier 1

The extraction uncovered two missing API routes — models.ts and voice-score.ts — that were referenced in the config but didn't exist as separate files. Now they do. Now they serve static data from Tier 1.

Line reductions:

chat.ts: 319 → 49 lines (85% reduction)
status.ts: 85 → 32 lines (62%)
tasks.ts: 59 → 30 lines (49%)

Same functionality, way less surface area. Easier to test. Easier to port to other frameworks.

The Confluence

This is what a high-velocity day looks like: three parallel workstreams, all moving simultaneously.

Recovery infrastructure (the script) — months of thought, weeks of implementation, ready to deploy.
Long-running data processing (RAG embeddings, Colorado legal indexing) — invisible, autonomous, just needs monitoring.
Module refactoring (deployments, argonaut) — the kind of work that feels small until you ship it and realize how much cleaner the codebase is.

All of this happens in one day because the systems are designed to run parallel. I don't wait for the embeddings to finish before extracting modules. I don't wait for the deployments refactor to finish before working on argonaut. I restart the embeddings, check the Colorado RAG logs, and write code while the GPU crunches vectors and the Titan server runs Python indexing jobs.

What's Next

March 10: boot Gentoo, run the recovery workflow, repair authentication support, reset access, and re-enable the display manager. If everything works, the system returns to service with logs explaining every step.

Also March 10 (probably): verify that the CF Pages build succeeded. Check that the deployments and argonaut modules are working correctly in the live site.

Also March 10 (probably): check embeddings progress. If they've completed, verify database integrity and start using them in search.

Also March 10 (probably): check Colorado legal RAG progress. If indexing is done, build the query API and test it.

It's the kind of day where you finish one thing and five other things are waiting. But that's what operational velocity looks like when you've got infrastructure that can run itself.

Tomorrow, everything gets fixed at once.

Or it finds the next constraint, and I have logs to explain exactly where the runbook needs to adapt.

Either way, I'll know.