Skip to main content
Back to Journal
user@argobox:~/journal/2026-01-23-the-race-condition-that-ate-my-binaries
$ cat entry.md

Build Swarm Validation and Container Runtime Cleanup

○ NOT REVIEWED

Build Swarm Validation and Container Runtime Cleanup

Date: 2026-01-23 Session: Major swarm overhaul Bugs Fixed: 4 New Scripts: 3


The Missing Binaries Mystery

Packages were building successfully. Drones were reporting completion. But the orchestrator validation pass showed that some handoffs were not reliably preserved for review.

The orchestrator logs showed validation failures for packages that definitely built. What was happening?


Bug #1: The rsync Delete Flag

Found it in the drone code:

rsync_cmd = [
    'rsync', '-av', '--remove-source-files',  # 💀
    f'{pkgdir}/',
    f'root@{orchestrator_ip}:{staging_path}/'
]

--remove-source-files deletes the local file immediately after upload. That is efficient, but it is the wrong default for a validation-first build pipeline.

The problem: the orchestrator runs validation after receiving the file. If validation fails for any reason (wrong checksum, path mismatch, whatever), the orchestrator rejects the package. But the drone had already removed its local copy, so the retry path had no source artifact to use.

The fix: Don't delete until the orchestrator confirms acceptance.

def upload_binary(orchestrator_ip, package):
    # rsync WITHOUT --remove-source-files
    rsync_cmd = [
        'rsync', '-av', '--ignore-existing',
        f'{pkgdir}/',
        f'root@{orchestrator_ip}:{staging_path}/'
    ]
    subprocess.check_call(rsync_cmd)
    # Local copy still exists

def cleanup_local_binaries(package):
    """Only called after orchestrator confirms success."""
    # Now it's safe to delete

The orchestrator's completion response now includes an accepted field:

self.send_json({'status': 'ok', 'accepted': accepted})

Drones only clean up when accepted: true.


Bug #2: Global Variable in Threading

The heartbeat worker thread was throwing UnboundLocalError:

UnboundLocalError: local variable 'current_package' referenced before assignment

Classic Python threading mistake. Variables in the main thread aren't automatically visible in spawned threads.

# BEFORE (broken)
def heartbeat_worker():
    while True:
        phone_home()
        if current_package and build_start_time > 0:  # 💥
            # ...

# AFTER (fixed)
def heartbeat_worker():
    global current_package, build_start_time  # Added this
    while True:
        phone_home()
        if current_package and build_start_time > 0:
            # ...

One line. Hours of debugging.


Bug #3: Container Runtime Assumptions

A drone container was still using an older startup path from a previous SSH-oriented architecture. Logs showed:

Error: SSH key not found at /root/.ssh/id_rsa

The container was using an old entrypoint from when drones communicated over SSH. The current architecture uses HTTP for control-plane coordination, so the image needed to be brought back in line with the runtime model.

The fix: Rebuild with the current HTTP-based drone:

docker run -d \
    --name drone-Meridian \
    --hostname drone-Meridian \
    --network host \
    --restart unless-stopped \
    -e GATEWAY_URL=http://gateway.internal:8090 \
    -v drone-swarm-code:/opt/build-swarm:ro \
    -v drone-Meridian-portage:/var/db/repos/gentoo \
    -v /root/.ssh/id_ed25519:/root/.ssh/id_rsa:ro \
    --entrypoint /bin/bash \
    gentoo-drone:current \
    -c '
        export PYTHONPATH=/opt/build-swarm/lib:$PYTHONPATH
        exec python3 /opt/build-swarm/bin/swarm-drone
    '

Key changes:

  • Use the host network deliberately for the lab runtime model
  • Override entrypoint to run Python drone directly
  • Keep file-transfer credentials scoped to the upload path only

Bug #4: Binary Path Validation

This one's still partially open. The orchestrator's binary validation is looking for files in the wrong path:

Looking for: /var/cache/binpkgs/sys-libs/libseccomp-release/libseccomp-package.gpkg.tar
Should be:   /var/cache/binpkgs/sys-libs/libseccomp/libseccomp-package.gpkg.tar

The package release identifier shouldn't be in the directory path. This causes "missing_binary" errors for packages that actually built successfully. Added to the fix list.


New Automation Scripts

Got tired of manually setting up nodes. Wrote scripts:

setup-drone.sh:

./scripts/setup-drone.sh drone-host.internal drone-Tau-Beta

What it does:

  1. Fixes hostname resolution
  2. Creates directories
  3. Deploys drone code
  4. Creates OpenRC init script
  5. Configures sleep prevention (no hibernating mid-build)
  6. Syncs portage tree
  7. Enables and starts service
  8. Verifies gateway registration

Also wrote setup-orchestrator.sh and setup-gateway.sh for the other components.


LXC Container Conversion

Converted drone-Tau-Beta from bare-metal to an LXC container. Better isolation, easier to reset if something goes wrong.

Container config:

lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.link = eno1
lxc.cgroup2.cpuset.cpus = 0-5
lxc.cgroup2.memory.max = 24G
lxc.start.auto = 1

6 cores, 24GB RAM, auto-starts on boot. Gets its own IP via DHCP on the macvlan bridge.


Sleep Prevention

Drones kept going to sleep mid-build on systems with power management. Added elogind config:

# /etc/elogind/logind.conf.d/no-sleep.conf
[Login]
HandlePowerKey=ignore
HandleSuspendKey=ignore
HandleHibernateKey=ignore
HandleLidSwitch=ignore
IdleAction=ignore
IdleActionSec=infinity

No more surprise naps.


Current Swarm Status

After all the fixes:

Component Count Status
Gateway 1 ✅ Online
Orchestrators 2 ✅ Both online
Drones 3 ✅ All building
Total Cores 46 Active

Test build results:

  • 30 packages queued
  • 23 successful (76%)
  • 7 blocked (mostly path validation bug + nvidia-drivers needing kernel sources)

Deployment Commands (Reference)

# Deploy drone code to all drones
for drone in drone-Meridian drone-Izar drone-Tau-Beta; do
  scp bin/swarm-drone root@$(build-swarm $drone ip):/opt/build-swarm/bin/
  build-swarm $drone restart
done

# Setup a new drone
./scripts/setup-drone.sh <ip> [name]

# Check swarm status
curl -s http://gateway.internal:8090/api/current/nodes | python3 -m json.tool

Four bugs. One race condition eating binaries. One Docker container from a different era. One missing global statement. One path validation issue still pending.

The swarm is at 46 cores now. Almost broke 50, but one of the drones kept going to sleep. Fixed that too.