The Drone That Rebooted the Wrong Server
Date: 2026-01-27 Issue: Gateway rebooting 4+ times per day + multiple drones offline Root Cause: NAT masquerade + auto-heal = friendly fire Result: Swarm restored to 58 cores across 4 active drones
The Mystery
Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.
- No kernel panics
- No OOM kills
- No hardware errors
- NVMe temperature normal (46°C)
Every reboot was a clean shutdown. Like someone typed reboot.
I checked the auth logs:
Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...
Someone SSH'd in from 100.64.0.18 — which is Izar-Orchestrator — and the system shut down two seconds later.
The orchestrator was rebooting my gateway. But why?
The Full Damage Assessment
After Alpha-Centauri came back online, the swarm was in rough shape:
| Drone | Status | Problem |
|---|---|---|
| drone-Tau-Beta (10.42.0.194) | Offline | Service stopped |
| drone-Meridian (192.168.20.77) | Blocked | Hardcoded block in gateway code |
| drone-Tarn (192.168.20.196) | Online but dangerous | See below |
| drone-Izar (10.42.0.203) | Online | Working fine |
The gateway itself wasn't starting on boot — systemctl enable swarm-gateway had never been run.
Fix #1: Gateway Service Persistence
ssh root@10.42.0.199 'systemctl start swarm-gateway && systemctl enable swarm-gateway'
Now it survives reboots. Novel concept.
Fix #2: drone-Tau-Beta (The Easy One)
Just stopped. SSH'd in, started it:
ssh root@10.42.0.194 'rc-service swarm-drone start'
8 cores back in the pool.
Fix #3: drone-Meridian (The Blocked One)
This drone lives on the remote network (192.168.20.x) and had been causing trouble previously. Past Me had helpfully added a hardcoded block:
# Temporary Block: drone-Meridian (failed uploads, unreachable SSH)
if 'meridian' in node_data.get('name', '').lower():
log.warning(f"Registration rejected for BLOCKED node")
return {'error': 'Node blocked pending maintenance'}
I forgot to remove it after fixing the actual issue. Classic.
Removed the block, then configured the drone to use Tailscale IPs for everything:
GATEWAY_URL="http://100.64.0.88:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.110"
Also had to patch the drone code to actually use these variables:
# Changed from:
orch_config = {'ip': None, 'port': 8080}
# To:
orch_config = {'ip': os.environ.get('ORCHESTRATOR_IP'), 'port': int(os.environ.get('ORCHESTRATOR_PORT', 8080))}
20 cores back in the pool.
Fix #4: drone-Tarn (The Mystery Rebooter)
Now for the fun one.
The build swarm has an auto-reboot feature: when a drone fails too many builds in a row, the orchestrator SSHs to it and runs reboot. Clean slate, start fresh.
Here's the bug: drone-Tarn (192.168.20.196) was configured to reach the gateway at http://10.42.0.199:8090. That IP isn't directly routable from the remote network. Traffic goes through Tailscale subnet routing, which means it gets NAT'd (masqueraded) through Alpha-Centauri.
The orchestrator sees the source IP of drone-Tarn's traffic as... 10.42.0.199. The gateway's IP. Because that's where the NAT is happening.
When drone-Tarn failed builds and the orchestrator tried to reboot it:
# What the orchestrator thought it was doing:
ssh root@10.42.0.199 'reboot' # Reboot drone-Tarn
# What it actually did:
ssh root@10.42.0.199 'reboot' # Reboot the gateway
Same IP. Different intended targets.
The gateway was getting rebooted because a drone on another network was failing builds.
The Fix
Same pattern as drone-Meridian — configure it to report its Tailscale IP:
GATEWAY_URL="http://10.42.0.199:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91" # The critical one
Now when the orchestrator wants to reboot drone-Tarn, it SSHs to 100.64.0.91 — the drone's actual Tailscale IP — not the gateway.
14 cores back in the pool, and the gateway stopped getting randomly rebooted.
Final Swarm Status
Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online
Drones (58 total cores):
drone-Meridian | 100.64.0.110 | 20 cores | Online
drone-Izar | 10.42.0.203 | 16 cores | Online
drone-Tarn | 100.64.0.91 | 14 cores | Online
drone-Tau-Beta | 10.42.0.194 | 8 cores | Online
Cross-Network Drone Configuration (Reference)
For any drone not on the gateway's local network:
| Variable | Requirement | Why |
|---|---|---|
| GATEWAY_URL | Can use local IP | Tailscale subnet routing handles it |
| ORCHESTRATOR_IP | MUST be Tailscale IP | Avoids NAT masquerade issue |
| UPLOAD_HOST | MUST be Tailscale IP | For binary uploads |
| REPORT_IP | MUST be drone's Tailscale IP | So orchestrator can reach it for SSH |
The REPORT_IP is the critical one. That's what the orchestrator uses for SSH commands. Get it wrong, and you reboot the wrong server.
Tailscale IPs Reference (Build Swarm)
| Host | Local IP | Tailscale IP | Role |
|---|---|---|---|
| Alpha-Centauri | 10.42.0.199 | 100.64.0.88 | Gateway |
| Izar-Orchestrator | 10.42.0.201 | 100.64.0.18 | Orchestrator |
| drone-Izar | 10.42.0.203 | 100.64.0.126 | Drone |
| drone-Tarn | 192.168.20.196 | 100.64.0.91 | Drone |
| drone-Tau-Beta | 10.42.0.194 | 100.64.0.125 | Drone |
| drone-Meridian | 192.168.20.77 | 100.64.0.110 | Drone |
Prevention
- Protected hosts list: IPs that should never receive reboot commands
- Enable services on boot: Don't rely on manual starts after reboot
- Document Tailscale mappings: Keep a reference of which IP is which
- Remove debug blocks: Temporary workarounds shouldn't be permanent
Lessons Learned
- NAT masquerade hides real IPs — if your distributed system uses IP addresses for identity, NAT will betray you
- REPORT_IP is critical — the IP a node reports should be reachable for management commands
- Auto-reboot features need the right target — make absolutely sure it's rebooting the right server
- Past Me leaves landmines — that "temporary block" was five days old and I forgot it existed
- Clean shutdowns have a cause — if there's no crash, something initiated the shutdown. Check auth logs
The swarm is back to 58 cores. The gateway stopped getting randomly rebooted. And I added a protected hosts list so this can never happen again.