Skip to main content
Back to Journal
user@argobox:~/journal/2026-01-27-the-drone-that-rebooted-the-wrong-server
$ cat entry.md

The Drone That Rebooted the Wrong Server

○ NOT REVIEWED

The Drone That Rebooted the Wrong Server

Date: 2026-01-27 Issue: Gateway rebooting 4+ times per day + multiple drones offline Root Cause: NAT masquerade + auto-heal = friendly fire Result: Swarm restored to 58 cores across 4 active drones


The Mystery

Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.

  • No kernel panics
  • No OOM kills
  • No hardware errors
  • NVMe temperature normal (46°C)

Every reboot was a clean shutdown. Like someone typed reboot.

I checked the auth logs:

Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...

Someone SSH'd in from 100.64.0.18 — which is Izar-Orchestrator — and the system shut down two seconds later.

The orchestrator was rebooting my gateway. But why?


The Full Damage Assessment

After Alpha-Centauri came back online, the swarm was in rough shape:

Drone Status Problem
drone-Tau-Beta (10.42.0.194) Offline Service stopped
drone-Meridian (192.168.20.77) Blocked Hardcoded block in gateway code
drone-Tarn (192.168.20.196) Online but dangerous See below
drone-Izar (10.42.0.203) Online Working fine

The gateway itself wasn't starting on boot — systemctl enable swarm-gateway had never been run.


Fix #1: Gateway Service Persistence

ssh root@10.42.0.199 'systemctl start swarm-gateway && systemctl enable swarm-gateway'

Now it survives reboots. Novel concept.

Fix #2: drone-Tau-Beta (The Easy One)

Just stopped. SSH'd in, started it:

ssh root@10.42.0.194 'rc-service swarm-drone start'

8 cores back in the pool.

Fix #3: drone-Meridian (The Blocked One)

This drone lives on the remote network (192.168.20.x) and had been causing trouble previously. Past Me had helpfully added a hardcoded block:

# Temporary Block: drone-Meridian (failed uploads, unreachable SSH)
if 'meridian' in node_data.get('name', '').lower():
    log.warning(f"Registration rejected for BLOCKED node")
    return {'error': 'Node blocked pending maintenance'}

I forgot to remove it after fixing the actual issue. Classic.

Removed the block, then configured the drone to use Tailscale IPs for everything:

GATEWAY_URL="http://100.64.0.88:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.110"

Also had to patch the drone code to actually use these variables:

# Changed from:
orch_config = {'ip': None, 'port': 8080}

# To:
orch_config = {'ip': os.environ.get('ORCHESTRATOR_IP'), 'port': int(os.environ.get('ORCHESTRATOR_PORT', 8080))}

20 cores back in the pool.


Fix #4: drone-Tarn (The Mystery Rebooter)

Now for the fun one.

The build swarm has an auto-reboot feature: when a drone fails too many builds in a row, the orchestrator SSHs to it and runs reboot. Clean slate, start fresh.

Here's the bug: drone-Tarn (192.168.20.196) was configured to reach the gateway at http://10.42.0.199:8090. That IP isn't directly routable from the remote network. Traffic goes through Tailscale subnet routing, which means it gets NAT'd (masqueraded) through Alpha-Centauri.

The orchestrator sees the source IP of drone-Tarn's traffic as... 10.42.0.199. The gateway's IP. Because that's where the NAT is happening.

When drone-Tarn failed builds and the orchestrator tried to reboot it:

# What the orchestrator thought it was doing:
ssh root@10.42.0.199 'reboot'  # Reboot drone-Tarn

# What it actually did:
ssh root@10.42.0.199 'reboot'  # Reboot the gateway

Same IP. Different intended targets.

The gateway was getting rebooted because a drone on another network was failing builds.

The Fix

Same pattern as drone-Meridian — configure it to report its Tailscale IP:

GATEWAY_URL="http://10.42.0.199:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91"  # The critical one

Now when the orchestrator wants to reboot drone-Tarn, it SSHs to 100.64.0.91 — the drone's actual Tailscale IP — not the gateway.

14 cores back in the pool, and the gateway stopped getting randomly rebooted.


Final Swarm Status

Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online

Drones (58 total cores):
  drone-Meridian  | 100.64.0.110 | 20 cores | Online
  drone-Izar      | 10.42.0.203  | 16 cores | Online
  drone-Tarn      | 100.64.0.91  | 14 cores | Online
  drone-Tau-Beta  | 10.42.0.194  |  8 cores | Online

Cross-Network Drone Configuration (Reference)

For any drone not on the gateway's local network:

Variable Requirement Why
GATEWAY_URL Can use local IP Tailscale subnet routing handles it
ORCHESTRATOR_IP MUST be Tailscale IP Avoids NAT masquerade issue
UPLOAD_HOST MUST be Tailscale IP For binary uploads
REPORT_IP MUST be drone's Tailscale IP So orchestrator can reach it for SSH

The REPORT_IP is the critical one. That's what the orchestrator uses for SSH commands. Get it wrong, and you reboot the wrong server.

Tailscale IPs Reference (Build Swarm)

Host Local IP Tailscale IP Role
Alpha-Centauri 10.42.0.199 100.64.0.88 Gateway
Izar-Orchestrator 10.42.0.201 100.64.0.18 Orchestrator
drone-Izar 10.42.0.203 100.64.0.126 Drone
drone-Tarn 192.168.20.196 100.64.0.91 Drone
drone-Tau-Beta 10.42.0.194 100.64.0.125 Drone
drone-Meridian 192.168.20.77 100.64.0.110 Drone

Prevention

  1. Protected hosts list: IPs that should never receive reboot commands
  2. Enable services on boot: Don't rely on manual starts after reboot
  3. Document Tailscale mappings: Keep a reference of which IP is which
  4. Remove debug blocks: Temporary workarounds shouldn't be permanent

Lessons Learned

  1. NAT masquerade hides real IPs — if your distributed system uses IP addresses for identity, NAT will betray you
  2. REPORT_IP is critical — the IP a node reports should be reachable for management commands
  3. Auto-reboot features need the right target — make absolutely sure it's rebooting the right server
  4. Past Me leaves landmines — that "temporary block" was five days old and I forgot it existed
  5. Clean shutdowns have a cause — if there's no crash, something initiated the shutdown. Check auth logs

The swarm is back to 58 cores. The gateway stopped getting randomly rebooted. And I added a protected hosts list so this can never happen again.