The Drone That Rebooted the Wrong Server

Date: 2026-01-27 Issue: Gateway rebooting 4+ times per day + multiple drones offline Root Cause: NAT masquerade + auto-heal = friendly fire Result: Swarm restored to 58 cores across 4 active drones

The Mystery

Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.

No kernel panics
No OOM kills
No hardware errors
NVMe temperature normal (46°C)

Every reboot was a clean shutdown. Like someone typed reboot.

I checked the auth logs:

Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...

Someone SSH'd in from 100.64.0.18 — which is Izar-Orchestrator — and the system shut down two seconds later.

The orchestrator was rebooting my gateway. But why?

The Full Damage Assessment

After Alpha-Centauri came back online, the swarm was in rough shape:

Drone	Status	Problem
drone-Tau-Beta (10.42.0.194)	Offline	Service stopped
drone-Meridian (192.168.20.77)	Blocked	Hardcoded block in gateway code
drone-Tarn (192.168.20.196)	Online but dangerous	See below
drone-Izar (10.42.0.203)	Online	Working fine

The gateway itself wasn't starting on boot — systemctl enable swarm-gateway had never been run.

Fix #1: Gateway Service Persistence

ssh root@10.42.0.199 'systemctl start swarm-gateway && systemctl enable swarm-gateway'

Now it survives reboots. Novel concept.

Fix #2: drone-Tau-Beta (The Easy One)

Just stopped. SSH'd in, started it:

ssh root@10.42.0.194 'rc-service swarm-drone start'

8 cores back in the pool.

Fix #3: drone-Meridian (The Blocked One)

This drone lives on the remote network (192.168.20.x) and had been causing trouble previously. Past Me had helpfully added a hardcoded block:

# Temporary Block: drone-Meridian (failed uploads, unreachable SSH)
if 'meridian' in node_data.get('name', '').lower():
    log.warning(f"Registration rejected for BLOCKED node")
    return {'error': 'Node blocked pending maintenance'}

I forgot to remove it after fixing the actual issue. Classic.

Removed the block, then configured the drone to use Tailscale IPs for everything:

GATEWAY_URL="http://100.64.0.88:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.110"

Also had to patch the drone code to actually use these variables:

# Changed from:
orch_config = {'ip': None, 'port': 8080}

# To:
orch_config = {'ip': os.environ.get('ORCHESTRATOR_IP'), 'port': int(os.environ.get('ORCHESTRATOR_PORT', 8080))}

20 cores back in the pool.

Fix #4: drone-Tarn (The Mystery Rebooter)

Now for the fun one.

The build swarm has an auto-reboot feature: when a drone fails too many builds in a row, the orchestrator SSHs to it and runs reboot. Clean slate, start fresh.

Here's the bug: drone-Tarn (192.168.20.196) was configured to reach the gateway at http://10.42.0.199:8090. That IP isn't directly routable from the remote network. Traffic goes through Tailscale subnet routing, which means it gets NAT'd (masqueraded) through Alpha-Centauri.

The orchestrator sees the source IP of drone-Tarn's traffic as... 10.42.0.199. The gateway's IP. Because that's where the NAT is happening.

When drone-Tarn failed builds and the orchestrator tried to reboot it:

# What the orchestrator thought it was doing:
ssh root@10.42.0.199 'reboot'  # Reboot drone-Tarn

# What it actually did:
ssh root@10.42.0.199 'reboot'  # Reboot the gateway

Same IP. Different intended targets.

The gateway was getting rebooted because a drone on another network was failing builds.

The Fix

Same pattern as drone-Meridian — configure it to report its Tailscale IP:

GATEWAY_URL="http://10.42.0.199:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91"  # The critical one

Now when the orchestrator wants to reboot drone-Tarn, it SSHs to 100.64.0.91 — the drone's actual Tailscale IP — not the gateway.

14 cores back in the pool, and the gateway stopped getting randomly rebooted.

Final Swarm Status

Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online

Drones (58 total cores):
  drone-Meridian  | 100.64.0.110 | 20 cores | Online
  drone-Izar      | 10.42.0.203  | 16 cores | Online
  drone-Tarn      | 100.64.0.91  | 14 cores | Online
  drone-Tau-Beta  | 10.42.0.194  |  8 cores | Online

Cross-Network Drone Configuration (Reference)

For any drone not on the gateway's local network:

Variable	Requirement	Why
GATEWAY_URL	Can use local IP	Tailscale subnet routing handles it
ORCHESTRATOR_IP	MUST be Tailscale IP	Avoids NAT masquerade issue
UPLOAD_HOST	MUST be Tailscale IP	For binary uploads
REPORT_IP	MUST be drone's Tailscale IP	So orchestrator can reach it for SSH

The REPORT_IP is the critical one. That's what the orchestrator uses for SSH commands. Get it wrong, and you reboot the wrong server.

Tailscale IPs Reference (Build Swarm)

Host	Local IP	Tailscale IP	Role
Alpha-Centauri	10.42.0.199	100.64.0.88	Gateway
Izar-Orchestrator	10.42.0.201	100.64.0.18	Orchestrator
drone-Izar	10.42.0.203	100.64.0.126	Drone
drone-Tarn	192.168.20.196	100.64.0.91	Drone
drone-Tau-Beta	10.42.0.194	100.64.0.125	Drone
drone-Meridian	192.168.20.77	100.64.0.110	Drone

Prevention

Protected hosts list: IPs that should never receive reboot commands
Enable services on boot: Don't rely on manual starts after reboot
Document Tailscale mappings: Keep a reference of which IP is which
Remove debug blocks: Temporary workarounds shouldn't be permanent

Lessons Learned

NAT masquerade hides real IPs — if your distributed system uses IP addresses for identity, NAT will betray you
REPORT_IP is critical — the IP a node reports should be reachable for management commands
Auto-reboot features need the right target — make absolutely sure it's rebooting the right server
Past Me leaves landmines — that "temporary block" was five days old and I forgot it existed
Clean shutdowns have a cause — if there's no crash, something initiated the shutdown. Check auth logs

The swarm is back to 58 cores. The gateway stopped getting randomly rebooted. And I added a protected hosts list so this can never happen again.