Managing a distributed build system is fun until a drone goes rogue and fails the same build 500 times in a minute. This is the story of a week spent making the swarm actually production-ready.

Saturday Morning: The Cron Job That Wasn’t

Saturday, 8 AM. No builds had run overnight. The queue was full of needed packages but nothing was moving. All drones showed “Online” but idle.

The binhost server runs Gentoo with dcron. When I’d updated the crontab on Friday, I used cronie-style syntax:

# This works in cronie
0 2 * * * /opt/build-swarm/scripts/nightly-update.sh

But dcron doesn’t understand the same format. It silently failed to load the entry.

The fix: Switched the binhost to cronie for consistency. Added a health check that verifies cron jobs are actually scheduled. The nightly update script now logs “STARTED” at the beginning so we know it ran.

Lesson: “Silent failure” is the worst failure mode. Test cron changes immediately.


The Runaway Drone Problem

While debugging the cron issue, I noticed something worse: drone-Tarn was in a retry loop of death. It would claim jobs from the queue, fail them instantly due to a configuration issue, then immediately claim them again. This cycle blocked the entire build queue.

The “Split Brain” issue was also real: The orchestrator thought a drone was working, but the drone had actually crashed. Jobs would sit in delegated state for hours.

The Solution: Circuit Breakers

Borrowed from microservices architecture:

Three Strikes Rule (well, Five): If a drone reports 5 consecutive failures, it gets “Grounded” — no new work for 5 minutes.

Auto-Reclaim: When a drone is grounded, any work delegated to it immediately goes back in the needed queue.

Maintenance Loop: Runs every 60 seconds, sweeps for offline drones, reclaims their work.

Auto-Reboot: If enabled, the orchestrator can SSH into a grounded drone and restart it:

ssh drone-Izar "rc-service build-drone restart"

Upgrading the Monitor

The original build-swarm monitor was ugly. Fixed-width columns that broke on long package names. No colors. No resource tracking. Just “Online” or “Offline”.

Resource Tracking

Drones now report resource usage in their heartbeat:

{
  "drone_id": "drone-Izar",
  "status": "building",
  "current_task": "sys-devel/llvm-19.1.7",
  "cpu_percent": 87.3,
  "memory_percent": 45.2
}

Visual Improvements

The monitor now uses Python’s rich library:

┌─────────────────────────────────────────────────────────────────┐
│ Build Swarm Monitor          Needed: 45  Built: 312  Rate: 8/hr │
├─────────────────────────────────────────────────────────────────┤
│ DRONE          │ STATUS   │ TASK                  │ CPU  │ RAM  │
├────────────────┼──────────┼───────────────────────┼──────┼──────┤
│ drone-Izar     │ BUILDING │ sys-devel/llvm-19.1.7 │ 87%  │ 45%  │
│ drone-Tarn     │ BUILDING │ dev-qt/qtbase-6.8.2   │ 92%  │ 38%  │
│ drone-Tau-Beta │ IDLE     │ -                     │ 2%   │ 12%  │
│ drone-Meridian │ BUILDING │ www-client/firefox    │ 95%  │ 62%  │
└─────────────────────────────────────────────────────────────────┘

Green for building, yellow for idle, red for grounded. Build watching became a spectator sport.


Code Review: Finding the Gremlins

Took a systematic pass through both apkg and the build swarm codebase.

apkg Issues

  • Hardcoded paths — Several scripts had paths hardcoded instead of using $HOME
  • Missing error handling — SSH failures to binhost weren’t caught gracefully
  • Duplicate code — Package resolution logic was copied in three places

Fixed by moving to XDG paths, adding retry logic with exponential backoff, and creating a PackageResolver class.

Build Swarm Issues

  • Race condition — Two drones could claim the same job if requests hit within milliseconds
  • Memory leak — Build logs accumulated in memory without cleanup
  • No graceful shutdown — Killing a drone mid-build left orphaned jobs

Fixed with Redis-based locking for job claims, log rotation with configurable retention, and a SIGTERM handler that finishes the current build before exit.

Test coverage: 67% → 84%


Self-Healing Goes Live

Built three levels of automatic recovery:

Level 1: Job Reclamation — When a drone goes offline, its delegated work is automatically reclaimed after 60 seconds.

Level 2: Drone Restart — If a drone is grounded (too many failures), the orchestrator can SSH in and restart the service.

Level 3: Full Reboot — For hardware-level issues, the orchestrator can trigger a full system reboot. Gated behind confirmation and only used after multiple restart attempts fail.

The Test

Same day I deployed self-healing, the Andromeda network had a power event. Three drones went offline simultaneously.

What happened:

  • 15:42 — Power blip at Andromeda site
  • 15:43 — drone-Tarn, drone-Meridian offline
  • 15:44 — Self-healing kicked in, reclaimed 12 jobs
  • 15:45 — Remaining drones picked up the work
  • 16:10 — Andromeda drones came back online
  • 16:11 — Automatic reintegration into swarm

What would have happened before: Jobs stuck for hours. Manual intervention required. Angry developer.


Automation Polish

Portage Sync Tolerance

The nightly sync occasionally failed due to mirror timeouts. Added retry logic with fallback:

emerge --sync || emerge-webrsync

Binary Validation

Found that network hiccups during upload could result in truncated binaries. Drones now:

  1. Build the package
  2. Calculate SHA256
  3. Upload binary + checksum
  4. Binhost verifies before accepting

Better Status Reporting

The apkg status command now shows everything at a glance:

$ apkg status
Sync: 2026-01-23 02:00 (12 hours ago)
World: 264 packages
Binary: 98.2% available (4,637/4,722)
Swarm: 4/5 drones online, 12 packages building

Before and After

Before:

  • Drone fails → Retry loop of death
  • Drone offline → Job stuck for 3 hours
  • Observability → “It’s probably working”

After:

  • Drone fails → Grounded → Reboots → Recovers
  • Drone offline → Work reclaimed in 60s
  • Observability → “drone-Izar is building mesa-25.3.3 at 87% CPU”

The swarm went from “needs babysitting” to “runs itself.” That’s the difference between a prototype and production.

The swarm is watching itself now.