Skip to main content
Back to Journal
user@argobox:~/journal/2026-02-07-25-bugs-one-night
$ cat entry.md

Build Swarm Stabilization Sprint

○ NOT REVIEWED

Build Swarm Stabilization Sprint

Session: Feb 7, 6:00 PM -- Feb 8, 6:30 AM Duration: 12.5 hours Bugs Fixed: 25+ Components Touched: 4 Coffee Consumed: Don't ask


18:00 -- The Audit Begins

Sat down planning to "quickly review" the codebase before a deploy. Opened the drone module first. Opened five more files. Opened the coordinator, the dashboard, the auto-balance logic.

By 18:30 I had six component reviews running in parallel and a growing list of edge cases worth fixing before rollout.

19:15 -- The Drone Bugs (7 total)

Seven bugs in the drone module alone. Most were edge cases around error handling and state transitions. But one was special.

The recursive retry bug was the most important one.

When a package build failed and the drone retried, the retry logic was appending --retry-after-fix to the package atom string. Every. Single. Retry.

Attempt 1: emerge -1 sys-libs/glibc
Attempt 2: emerge -1 sys-libs/glibc --retry-after-fix
Attempt 3: emerge -1 sys-libs/glibc --retry-after-fix --retry-after-fix
Attempt 4: emerge -1 sys-libs/glibc --retry-after-fix --retry-after-fix --retry-after-fix

Portage doesn't know what --retry-after-fix is. So every retry after the first was guaranteed to fail. Which triggered another retry. Which appended another flag. The string was growing unbounded until the drone gave up and grounded itself.

The fix was to keep retry metadata separate from the package atom so the command stays stable across attempts.

21:40 -- The Coordinator Bugs (8 total)

Eight bugs in the coordinator. State machine inconsistencies, a race condition where two drones could claim the same package if the coordinator was slow to update, and a nasty one where completed packages weren't being removed from the queue under specific failure-then-success sequences.

The race condition was the highest-risk item. In a 4-drone swarm, the odds of collision were low but nonzero. On a large queue run with multiple fast workers, it would have surfaced eventually.

23:30 -- Dashboard Overhaul

The dashboard had a WebSocket data overwrite bug. When multiple drones reported status within the same render cycle, the last one to report would overwrite the others. So if drone-Tarn reported idle and drone-Izar reported building, you'd only see drone-Izar's state.

Turned out the WebSocket handler was writing directly to shared state without merging. Fixed it with per-drone state buckets that merge on render.

01:15 -- Auto-Balance for Idle Drones

New feature instead of a fix. If a drone finishes its batch and the queue still has work, the coordinator now automatically assigns more. Before this, idle drones could remain idle while others were still grinding.

drone-Meridian has 24 cores. When it finishes fast and sits idle while drone-Tau-Beta (8 cores) is still chugging, that's wasted compute.

03:00 -- Deployment

Release marker updated. Built. Deployed to all four drones and orchestrator-Izar.

for host in drone-Izar drone-Tarn drone-Meridian drone-Tau-Beta orchestrator-Izar; do
    echo "Deploying stabilization build to $host..."
    # rsync, restart, verify
done

Every node updated. Every service restarted. Every health check green.

05:45 -- Running Clean

Kicked off a full queue run to verify. Watched the dashboard (now actually showing all drones correctly) as work flowed through.

Queue:     152 packages
Drones:    4 active, 0 idle, 0 grounded
Cores:     58 active
Status:    BUILDING

152 packages in the queue. 58 cores across 4 drones. All of them building. No corrupted atom strings. No race conditions. No overwritten dashboard state. No idle drones sitting around doing nothing.

06:30 -- Done

12.5 hours. 25+ bugs. 1 new feature. 4 components. 5 nodes deployed. Stabilization build out the door before sunrise.

The recursive retry bug was the worst one -- silent corruption that made every retry worse than the last. The kind of bug that looks like "intermittent build failures" in the logs until you read the actual emerge command and see a flag that shouldn't exist, appended four times.

I should sleep. The drones don't need to, but I do.


152 packages. 58 cores. Zero corrupted atom strings. The stabilization build earned its release marker.