Skip to main content
Back to Journal
user@argobox:~/journal/2026-02-05-stale-queue-grounding
$ cat entry.md

The Stale Queue That Grounded My Drones

○ NOT REVIEWED

The Stale Queue That Grounded My Drones

Date: 2026-02-05 Duration: 7+ hours Issue: All drones idle despite 149 packages in queue Root Cause: Package-tree mismatch, stale queue, undeployed code, corrupt binpkgs Fix: Deploy marker refresh, force deploy, build-swarm fresh


The Symptom

149 packages needed building. Five drones available. All five showing IDLE.

drone-Izar       IDLE    16 cores available
drone-Tarn       IDLE    14 cores available
drone-Meridian   IDLE    24 cores available
drone-Tau-Beta   IDLE     8 cores available
sweeper-Capella  IDLE     6 cores available

68 cores doing absolutely nothing while almost 150 packages sat in the queue collecting dust.

10:30 -- First Theory: Coordinator Bug

Checked the coordinator logs. No errors. It was happily reporting "queue populated, dispatching work" -- but the drones weren't picking it up. Or more accurately, they were picking it up, failing, and getting grounded.

[orchestrator-Izar] drone-Izar: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Tarn: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Meridian: grounded after 3 consecutive failures

Every drone was attempting builds, failing immediately, and getting grounded by the orchestrator's safety logic. Three strikes and you're out.

12:15 -- Second Theory: Package Issue

Pulled the build logs from drone-Izar. The emerge commands were failing because the packages referenced in the queue didn't match what was available. The queue had been generated from a newer package tree than the drones were using.

The queue was stale. It had been generated against a newer portage tree, but the drones hadn't been synced. Every package atom in the queue was either mismatched or had dependency conflicts.

14:00 -- The Real Problem: Undeployed Code

While digging, I found the real issue. The code changes from the previous session had not been deployed to the remote nodes.

The deploy marker was missing on the remotes. My deploy script checks that marker to decide if an update is needed. No marker was being treated as "up to date" because the check returned "no difference."

ssh orchestrator-Izar "cat /opt/build-swarm/DEPLOY_MARKER"
# cat: /opt/build-swarm/DEPLOY_MARKER: No such file or directory

So the drones were running old code with a stale queue generated by new code. The deploy check needed to fail closed.

15:45 -- Corrupt Binary Packages

While I was in there, I found corrupt binary packages for LLVM on the binhost. Three .xpak files were zero-byte. Any drone that tried to install LLVM from the binhost would fail, get retried, fail again, and get grounded.

LLVM is a dependency for a lot of packages. Those corrupt binpkgs were a hidden landmine.

16:30 -- The Fix

Three-part fix:

Deploy marker refresh -- new deploy marker so remotes can fail closed when they are missing metadata.

Force push to all nodes -- skip the "is it already up to date?" check. Just push everything.

for host in orchestrator-Izar drone-Izar drone-Tarn drone-Meridian drone-Tau-Beta sweeper-Capella; do
    rsync -avz --delete /opt/build-swarm/ $host:/opt/build-swarm/
done

build-swarm fresh -- nuclear option. Clears the stale queue, regenerates from current portage tree, rebuilds the corrupt binpkgs, and resets all drone grounding states.

[build-swarm] Clearing stale queue...
[build-swarm] Regenerating package list...
[build-swarm] Discovered 149 packages
[build-swarm] Resetting drone states...
[build-swarm] All drones cleared for duty

17:15 -- Back Online

All five drones building. Queue draining. No grounding events.

The package-tree mismatch was the root cause, but the missing deploy marker is what let it persist. The queue and drones need to be validated as a matched set before a distributed run.

Added a check: if the deploy marker doesn't exist on a remote, treat it as "needs update" instead of "up to date."


68 cores idle because a deploy marker was missing. The drones weren't broken; they were grounded by stale data and a deploy check that needed to fail closed.