The Stale Queue That Grounded My Drones
Date: 2026-02-05
Duration: 7+ hours
Issue: All drones idle despite 149 packages in queue
Root Cause: Version mismatch, stale queue, undeployed code, corrupt binpkgs
Fix: Version bump, force deploy, build-swarm fresh
The Symptom
149 packages needed building. Five drones available. All five showing IDLE.
drone-Izar IDLE 16 cores available
drone-Tarn IDLE 14 cores available
drone-Meridian IDLE 24 cores available
drone-Tau-Beta IDLE 8 cores available
sweeper-Capella IDLE 6 cores available
68 cores doing absolutely nothing while almost 150 packages sat in the queue collecting dust.
10:30 — First Theory: Coordinator Bug
Checked the coordinator logs. No errors. It was happily reporting “queue populated, dispatching work” — but the drones weren’t picking it up. Or more accurately, they were picking it up, failing, and getting grounded.
[orchestrator-Izar] drone-Izar: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Tarn: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Meridian: grounded after 3 consecutive failures
Every drone was attempting builds, failing immediately, and getting grounded by the orchestrator’s safety logic. Three strikes and you’re out.
12:15 — Second Theory: Package Issue
Pulled the build logs from drone-Izar. The emerge commands were failing because the packages referenced in the queue didn’t match what was available. Queue had version 6.22.0 packages. The actual driver on the drones had 6.20.0.
The queue was stale. It had been generated against a newer portage tree, but the drones hadn’t been synced. Every package atom in the queue was either the wrong version or had dependency conflicts.
14:00 — The Real Problem: Undeployed Code
While digging, I found something worse. The code changes from the previous session — bug fixes, improvements, the works — had never been deployed to the remote nodes.
How did I miss this? The VERSION file was missing on the remotes. My deploy script checks VERSION to decide if an update is needed. No VERSION file means “up to date” (because the check returns “no difference”). Beautiful logic, Past Me.
ssh orchestrator-Izar "cat /opt/build-swarm/VERSION"
# cat: /opt/build-swarm/VERSION: No such file or directory
So the drones were running old code with a stale queue generated by new code. No wonder everything was failing.
15:45 — Corrupt Binary Packages
While I was in there, found another surprise: corrupt binary packages for LLVM on the binhost. Three .xpak files that were zero-byte. Any drone that tried to install LLVM from the binhost would fail, get retried, fail again, and get grounded.
LLVM is a dependency for a lot of packages. Those corrupt binpkgs were a hidden landmine.
16:30 — The Fix
Three-part fix:
Version bump to 0.4.3 — new VERSION file, new deploy marker.
Force push to all nodes — skip the “is it already up to date?” check. Just push everything.
for host in orchestrator-Izar drone-Izar drone-Tarn drone-Meridian drone-Tau-Beta sweeper-Capella; do
rsync -avz --delete /opt/build-swarm/ $host:/opt/build-swarm/
done
build-swarm fresh — nuclear option. Clears the stale queue, regenerates from current portage tree, rebuilds the corrupt binpkgs, and resets all drone grounding states.
[build-swarm] Clearing stale queue...
[build-swarm] Regenerating package list...
[build-swarm] Discovered 149 packages
[build-swarm] Resetting drone states...
[build-swarm] All drones cleared for duty
17:15 — Back Online
All five drones building. Queue draining. No grounding events.
The version mismatch was the root cause, but the missing VERSION file is what let it fester. If the deploy had worked correctly last session, the queue and the drones would have been in sync from the start.
Added a check: if VERSION doesn’t exist on a remote, treat it as “needs update” instead of “up to date.”
68 cores idle for 7 hours because a file didn’t exist. The drones weren’t broken — they were grounded by stale data and a deploy that never happened.