The Stale Queue That Grounded My Drones
Date: 2026-02-05
Duration: 7+ hours
Issue: All drones idle despite 149 packages in queue
Root Cause: Package-tree mismatch, stale queue, undeployed code, corrupt binpkgs
Fix: Deploy marker refresh, force deploy, build-swarm fresh
The Symptom
149 packages needed building. Five drones available. All five showing IDLE.
drone-Izar IDLE 16 cores available
drone-Tarn IDLE 14 cores available
drone-Meridian IDLE 24 cores available
drone-Tau-Beta IDLE 8 cores available
sweeper-Capella IDLE 6 cores available
68 cores doing absolutely nothing while almost 150 packages sat in the queue collecting dust.
10:30 -- First Theory: Coordinator Bug
Checked the coordinator logs. No errors. It was happily reporting "queue populated, dispatching work" -- but the drones weren't picking it up. Or more accurately, they were picking it up, failing, and getting grounded.
[orchestrator-Izar] drone-Izar: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Tarn: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Meridian: grounded after 3 consecutive failures
Every drone was attempting builds, failing immediately, and getting grounded by the orchestrator's safety logic. Three strikes and you're out.
12:15 -- Second Theory: Package Issue
Pulled the build logs from drone-Izar. The emerge commands were failing because the packages referenced in the queue didn't match what was available. The queue had been generated from a newer package tree than the drones were using.
The queue was stale. It had been generated against a newer portage tree, but the drones hadn't been synced. Every package atom in the queue was either mismatched or had dependency conflicts.
14:00 -- The Real Problem: Undeployed Code
While digging, I found the real issue. The code changes from the previous session had not been deployed to the remote nodes.
The deploy marker was missing on the remotes. My deploy script checks that marker to decide if an update is needed. No marker was being treated as "up to date" because the check returned "no difference."
ssh orchestrator-Izar "cat /opt/build-swarm/DEPLOY_MARKER"
# cat: /opt/build-swarm/DEPLOY_MARKER: No such file or directory
So the drones were running old code with a stale queue generated by new code. The deploy check needed to fail closed.
15:45 -- Corrupt Binary Packages
While I was in there, I found corrupt binary packages for LLVM on the binhost. Three .xpak files were zero-byte. Any drone that tried to install LLVM from the binhost would fail, get retried, fail again, and get grounded.
LLVM is a dependency for a lot of packages. Those corrupt binpkgs were a hidden landmine.
16:30 -- The Fix
Three-part fix:
Deploy marker refresh -- new deploy marker so remotes can fail closed when they are missing metadata.
Force push to all nodes -- skip the "is it already up to date?" check. Just push everything.
for host in orchestrator-Izar drone-Izar drone-Tarn drone-Meridian drone-Tau-Beta sweeper-Capella; do
rsync -avz --delete /opt/build-swarm/ $host:/opt/build-swarm/
done
build-swarm fresh -- nuclear option. Clears the stale queue, regenerates from current portage tree, rebuilds the corrupt binpkgs, and resets all drone grounding states.
[build-swarm] Clearing stale queue...
[build-swarm] Regenerating package list...
[build-swarm] Discovered 149 packages
[build-swarm] Resetting drone states...
[build-swarm] All drones cleared for duty
17:15 -- Back Online
All five drones building. Queue draining. No grounding events.
The package-tree mismatch was the root cause, but the missing deploy marker is what let it persist. The queue and drones need to be validated as a matched set before a distributed run.
Added a check: if the deploy marker doesn't exist on a remote, treat it as "needs update" instead of "up to date."
68 cores idle because a deploy marker was missing. The drones weren't broken; they were grounded by stale data and a deploy check that needed to fail closed.