The Command Center Saga: 10 Days of Homelab Chaos, Recovery, and Finally Breathing

The Command Center Saga

Ten days. Storage archaeology. Dashboard building. A power crisis. A log flood that almost ate a RAM disk. A network ghost caused by a bridge interface. Teaching dad to use Proxmox over the phone during a football game. And then the week after — auditing everything, writing runbooks, and finally being able to sleep through power outages.

This is the full story of building proper monitoring infrastructure — and immediately needing it.


Day 1: The Great Storage Cleanup

Remote monitoring showed Meridian-Mako-Silo (the Unraid server at the Andromeda site) was running low on space. 62TB array, supposedly plenty of room. Something was wrong.

SSH'd in through the Proxmox jump box and started exploring. Found the issue within minutes:

/mnt/user/media/movies/New Folder/New Folder/New Folder/...

Forty. Levels. Deep.

Each level contained partial copies of the same files. The array wasn't running out of space — it was drowning in accidental duplicates.

The Archaeology

Spent two hours tracing the damage:

  • Movies: 3 complete copies spread across nested folders
  • TV Shows: 2 copies plus a third partial
  • Photos: Thankfully untouched
  • Backups: One folder that had been renamed 7 times

Timeline reconstruction suggests this happened over multiple "reorganizing" sessions. Each one created a new folder, moved things, then got interrupted. Classic.

The Cleanup

Wrote a script to identify duplicates by checksum:

find /mnt/user/media -type f -exec md5sum {} \; | \
  sort | uniq -d -w 32 > duplicates.txt

Result: 4.2TB of duplicate files.

After careful verification (and a backup snapshot), started the cleanup. Flattened folder structure. Removed verified duplicates. Created .no-touchy marker files in critical directories.

Called dad. Explained the situation diplomatically.

"I didn't do anything, the folders just... appeared."

Sure, dad. The folders spontaneously generated themselves.

Set up proper folder permissions: read-only for browsing, write access only through proper interfaces.

Result: Recovered 4.2TB. Array now at 45% instead of 89%.


Day 2: Building the Command Center

Because clicking through 15 different UIs was getting old.

We had dashboards everywhere:

  • Proxmox web UI on Izar-Orchestrator
  • Proxmox web UI on Tarn-Host
  • Unraid dashboard on Meridian-Mako
  • Synology DSM on Cassiel-Silo
  • Individual container UIs...
  • A terminal running htop
  • A glances web UI on another monitor

I was drowning in tabs.

With infrastructure spanning the local network (10.42.0.x) and a remote network (192.168.20.x), I needed a Single Pane of Glass. A real dashboard.

Enter Homepage.

The Stack

  • Dashboard: Homepage running in Docker on Altair-Link
  • Monitoring: Glances on each host for system stats
  • Status: Uptime Kuma for service health

The Glances Version Wrinkle

This one got me. Proxmox hosts run Glances v3 (from Debian repos). Everything else runs v4. Homepage supports both, but you have to tell it which one:

- Old Proxmox Host:
    widget:
        type: glances
        url: http://10.42.0.200:61208
        version: 3  # <--- The magic number

- New Proxmox Host:
    widget:
        type: glances
        url: http://192.168.20.100:61208
        version: 4

Spent 30 minutes debugging "why isn't Izar-Orchestrator showing up" before realizing the version mismatch. Homepage fails silently on version conflicts. No error. Just... no data.

Cross-Site Monitoring

The magic: Tailscale mesh VPN.

Altair-Link (Milky Way network) can see Tarn-Host (Andromeda network) with ~40ms latency. The dashboard pulls stats from both sites as if they were local.

Visualizing the Build Cluster

The Gentoo build cluster is the heart of the system. Workers scattered across both networks — LXC containers on both Proxmox hosts, Docker containers on Unraid, a dedicated test VM.

I wanted to see them all side-by-side. Initially dumped them in a list, but the widgets were huge. Switched to a matrix layout:

layout:
  Build Workers:
    style: row
    columns: 4

Now I can see CPU load across the entire cluster in one glance. When a compile job hits, the whole row lights up red. Beautiful.

Fixing Synology Sensors

One last annoyance: The Synology NAS units weren't showing disk I/O in the Glances widget. The Glances API was reporting fs:/volume1 or md2. Homepage wanted a physical disk identifier.

- Synology NAS:
    widget:
        type: glances
        url: http://192.168.20.7:61208
        disk: sda  # Not md2 or /volume1
        version: 4

Suddenly, the read/write graphs came alive.

Service Categories

Build Infrastructure — Gateway status, orchestrator status, drone fleet health

Hypervisors — Tarn-Host (Proxmox), Izar-Orchestrator (Proxmox)

Storage — Meridian-Mako-Silo (Unraid), Cassiel-Silo (Synology)

Media — Plex transcoder status, Tautulli activity

One URL: http://10.42.0.199:3001. All systems. Both sites. Real-time stats.

Homepage Tips

Version conflicts: Always check which Glances version your hosts run. Set it explicitly.

Cross-network routing: If your dashboard container can't reach a host, it fails silently. Test with curl from inside the container first.

Layout density: Use columns: N and style: row to pack more widgets horizontally.

Disk sensors: Synology and Linux software RAID report different disk identifiers than Glances expects. Check glances --stdout disk to find the right one.


Day 2.5: The Power Crisis

15:42 — Power fluctuation at the Andromeda site. Not a full outage — just enough to trip the UPS and cause everything to reboot.

Impact

Down:

  • Tarn-Host (Proxmox hypervisor)
  • All VMs on Tarn-Host (including drone-Tarn, orchestrator-standby)
  • Meridian-Mako-Silo (Unraid)
  • Cassiel-Silo (Synology)
  • Media services (Plex, etc.)

Unaffected:

  • Milky Way site (local network)
  • Build swarm (degraded but functional)
  • Primary workstation

Recovery Timeline

15:42 — Alert from Uptime Kuma: "Tarn-Host: DOWN"

15:43 — Build swarm self-healing kicks in. Detects drone-Tarn offline. Reclaims 8 delegated jobs. Redirects work to remaining drones.

15:45 — Attempted SSH to Andromeda hosts: timeout

15:50 — Called dad: "Did the power flicker?" / "Yeah, just for a second"

16:05 — Tarn-Host back online (auto-start after power restore). 4 VMs starting simultaneously.

16:12 — All services restored. Total outage: 30 minutes.

What Worked

  1. Uptime Kuma alerts — Knew immediately something was wrong
  2. Build swarm self-healing — No manual intervention needed for builds
  3. Proxmox auto-start — VMs came back on their own
  4. That new dashboard — Could see the recovery in real-time

What Didn't

  1. No automated notification to dad — Had to call manually
  2. VM boot order not optimized — Resource contention during simultaneous start
  3. No remote power control — Couldn't force restart anything remotely

Day 3: The Log Flood

The morning after the power crisis. Monitoring showed Meridian-Mako-Silo's log partition at 94% capacity.

For Unraid, where /var/log lives in RAM, this is deadly. Hit 100% and services crash, the GUI dies, and you're looking at a hard reboot.

The Detective Work

SSH'd in through the Proxmox jump box and ran du -sh /var/log/*. Nothing huge showed up.

That's when I remembered ghost files.

lsof +L1 /var/log

There it was. rsyslogd was holding a deleted file handle. It had grown to 112MB — massive for a RAM disk.

But why?

I dug into the active logs and found a scream of errors:

emhttpd: error: getxattr on /mnt/user/synology_mount: Operation not supported

The Synology NAS units are mounted via rclone into the main /mnt/user tree. Unraid's management daemon (emhttpd) constantly scans /mnt/user. It tries to read extended attributes (xattrs). Rclone mounts don't support xattrs.

Result: 172,800 errors per day. Every single one logged. Every single one filling the RAM disk.

The Fix

First, stop the bleeding:

kill -HUP $(cat /var/run/rsyslogd.pid)

This forced rsyslog to release the deleted file handle. Usage dropped from 94% to 6%.

Second, apply the filter. I couldn't move the mount points without breaking containers, so I told rsyslog to ignore the noise:

# /boot/config/rsyslog.conf
if $programname == "emhttpd" and $msg contains "synology" then stop

Silence is golden.

Lesson learned: Ghost logs are real. Always check lsof +L1 when disk space vanishes but du shows nothing.


Day 4: The Network Ghost

Day after the log flood fix. Meridian-Mako-Silo was online according to Proxmox, but:

  • Web UI: unreachable
  • SSH: connection refused
  • Plex: offline
  • NFS shares: timeout

The server was running but invisible to the network.

Remote Diagnosis

Couldn't SSH directly, but could access via Proxmox console passthrough:

  1. Logged into Tarn-Host Proxmox
  2. Opened Unraid VM console
  3. Found the server sitting at a login prompt, looking innocent

Logged in locally. Network interfaces were... interesting.

ip addr show
# eth0: NO-CARRIER
# br0: state DOWN

The bridge interface had decided not to start properly.

Root Cause

During the power event, the VM started before the virtual network switch was fully initialized. Unraid grabbed an interface that didn't exist yet, and never retried.

The Phone Call

Had to guide dad through opening the Proxmox console. Over the phone. While he was watching a football game.

"Click on the VM name. No, the name. The text. Yes. Now click Console. The button that says Console. No, not that one..."

We got there eventually.

Prevention

Added startup delay to the Unraid VM in Proxmox:

Start/Shutdown order: 3
Startup delay: 60

Also configured Unraid to be more aggressive about network recovery in the startup script.


Day 5: The Thunderbolt Incident

Because things were going too smoothly.

I wanted faster transfer speeds for the Unraid box. I had a spare Thunderbolt 10GbE adapter. I plugged it in.

IMMEDIATELY, the server went dark.

  • No Web UI
  • No SSH
  • No Docker services
  • Ping worked... sometimes?

I unplugged the adapter. It didn't come back.

Safe Mode Scramble

Booted into Safe Mode (no plugins, no Docker) to get access. Something had corrupted the network stack.

It turned out to be a perfect storm of three separate pre-existing failures that the Thunderbolt adapter exposed:

1. The Boot Script

My /boot/config/go file (Unraid's startup script) had syntax errors. Missing then keywords, extra fi blocks. It had been failing silently for who knows how long. The server worked despite the script, not because of it.

2. Docker Config

I had told Docker to use wlan0 (WiFi) as a custom network at some point. The WiFi interface was down. Docker refused to start, dragging dependent services down with it.

3. The Tailscale Routing Trap

This was the kicker.

The Proxmox hypervisor acts as a subnet router, advertising 192.168.20.0/24 to the Tailscale network. When the Unraid server connected to Tailscale, it learned the route to its own subnet via Tailscale.

So when I tried to SSH from the hypervisor (192.168.20.100) to Unraid (192.168.20.50):

  1. Packet goes Hypervisor → Unraid (Local LAN) ✓
  2. Unraid replies... but thinks the fastest path to 192.168.20.0/24 is via tailscale0!
  3. Packet goes Unraid → Tailscale relay → Hypervisor
  4. Hypervisor drops it because of asymmetric routing

The Resolution

Taught Unraid to ignore its own tail:

ip rule add to 192.168.20.0/24 lookup main priority 5200

This forces traffic destined for the local LAN to use the main routing table (eth0/br0), ignoring Tailscale's magic routes.

Lesson: Verify your boot scripts. A script with a syntax error might not run anything after the error. No warnings. Just silent failure. And Thunderbolt on Unraid? Cursed.


Days 8-10: The Audit and Documentation Sprint

The dust had settled. Now came the boring part. The essential part.

I'd learned something from the chaos: "it's probably fine" is not a status. I needed to verify every service, manually, with my own eyes.

The Full Audit

Saturday morning. Coffee. A blank checklist.

Build Infrastructure:

Service Status Notes
Gateway Responding API returns valid JSON
Orchestrator Processing Queue depth: 3 packages
Drones (5/5) Online All reporting heartbeats
Binhost Serving SSH connection stable
Client sync Working Pulled test package successfully

The swarm had survived the power event. Self-healing kicked in exactly as designed. When the Andromeda drones came back online, they automatically rejoined the collective without intervention. That felt good.

Media Services:

Service Status Notes
Plex (local) Streaming Tested playback on three clients
Plex (remote) Streaming Tailscale latency: 42ms
Tautulli Collecting Stats updating in real-time
Overseerr Processing Test request fulfilled

Everything worked. But Tautulli had stale session data — ghosts of streams that ended days ago. A restart cleared it.

Storage and Network: Green across the board. All drives healthy, Tailscale mesh connected, DNS resolving, VPN stable for 48+ hours.

The Minor Issues

Not everything was perfect:

  • Tautulli session ghosts — Sessions active when Tarn-Host went down were never properly closed
  • Backup schedule drift — One job at 2 AM competed with Portage sync for disk I/O. Moved to 3 AM
  • DNS record stale — An old A record pointed to an IP that hadn't existed for months

None would have caused an outage. All would have caused confusion during the next incident.

The Documentation Sprint

Auditing revealed the real gap: not in services, but in documentation.

When I was guiding dad through Proxmox over the phone, I realized: there was no "click here, then here, then here" guide. I was improvising from memory while he was watching football.

Never again.

What I wrote:

Network Topology Diagram — Not just IPs. A full visual: both sites, subnets, hypervisors vs VMs vs containers, Tailscale subnet router paths, the exact route a packet takes from my desktop to dad's Plex server.

Service Catalog — Every running service: what it does, what depends on it, where the config lives, how to restart it, what "broken" looks like.

Recovery Runbooks — Step-by-step procedures:

  • "Site is completely down" (check power, then UPS, then network, then hypervisor)
  • "Service X isn't responding" (flowchart by service type)
  • "How to add a new service" (the right way, not the fast way)
  • "How to explain to dad what broke" (simplified terminology, no jargon)

That last one is real. When I'm debugging at 11 PM and need dad to check a blinking light, I can't use words like "hypervisor" or "LXC container." The runbook has translations.

Emergency Contact Procedures — Printed. Laminated. Taped inside dad's network cabinet:

  1. Check if it's just the internet (can you load Google?)
  2. Check if Tailscale is connected (green icon in system tray)
  3. If both are fine, the problem is probably on my end — wait
  4. If the internet is down, restart the router (unplug, wait 30 seconds, plug back in)
  5. If that doesn't work, call the ISP

Cleanup Tasks

While documenting, found cruft:

  • Orphaned containers — Three Docker containers that hadn't run in months. Removed
  • Configuration archaeology — Files named traefik.yml.backup.old.working.FINAL. Archived to a dated folder
  • Naming inconsistencies — Some configs said "titan," others said "Tarn-Host." Standardized everything
  • DNS debt — Records for services that no longer existed. Cleaned

Maintenance Automation

Tasks I was doing manually every week became scripts:

  • Check certificate expiration dates
  • Verify backups actually ran
  • Check disk space on all nodes
  • Rotate logs before they fill drives

They run via cron. They report to Uptime Kuma. If a certificate is expiring in < 14 days, I get an alert. If a backup hasn't run in > 36 hours, alert. Any disk > 85% full, alert.

Synthetic Health Checks

The dashboard shows if services are "up." But "up" doesn't mean "working."

A container can be running while the application inside has crashed. A web server can respond to health checks while returning 500 errors to users.

Added synthetic checks:

  • Plex: Every 5 minutes, try to play a 10-second test video. If it fails, alert.
  • Build swarm: Every 5 minutes, query the gateway API. If it returns invalid JSON or times out, alert.
  • NAS: Every 5 minutes, list a test directory. If it fails or times out, alert.

These catch the "it's running but broken" failures that status pages miss.


The Scorecard

Metric Before After
Services documented ~40% 100%
Recovery runbooks 0 4
Orphaned containers 3 0
Stale DNS records 7 0
Automated health checks 0 12
Laminated emergency guides 0 1

What I Learned

Monitoring saves you. Having the dashboard up during the power crisis meant I could see recovery happening in real-time instead of wondering what was broken.

Automation handles the common cases. The build swarm self-healed without intervention. That freed me to focus on the uncommon case (network ghost).

Ghost logs are real. du lies when a deleted file has an open handle. lsof +L1 is your friend.

Remote access has limits. When you need to guide someone through console access over the phone, every extra step is painful. Simplify recovery procedures.

Infrastructure isn't done when it works. It's done when someone else could run it. When future-me won't curse past-me. When the documentation matches reality. When recovery doesn't require heroics.

Dad will reorganize folders. Accept this. Implement permissions.

Ten days. From "we should have better monitoring" to "thank god we have better monitoring." The swarm is stable. The documentation is current. The runbooks are written.

And next time something breaks at halftime, dad knows exactly where to click.

System: maintainable. Finally.