The Command Center Saga

Four days. Storage archaeology. Dashboard building. A power crisis. Teaching dad to use Proxmox over the phone. This is the full story of building proper monitoring infrastructure—and immediately needing it.

Day 1: The Great Storage Cleanup

Remote monitoring showed Meridian-Mako-Silo (the Unraid server at the Andromeda site) was running low on space. 62TB array, supposedly plenty of room. Something was wrong.

SSH’d in and started exploring. Found the issue within minutes:

/mnt/user/media/movies/New Folder/New Folder/New Folder/...

Forty. Levels. Deep.

Each level contained partial copies of the same files. The array wasn’t running out of space—it was drowning in accidental duplicates.

The Archaeology

Spent two hours tracing the damage:

Movies: 3 complete copies spread across nested folders
TV Shows: 2 copies plus a third partial
Photos: Thankfully untouched
Backups: One folder that had been renamed 7 times

Timeline reconstruction suggests this happened over multiple “reorganizing” sessions. Each created a new folder, moved things, then got interrupted.

The Cleanup

Wrote a script to identify duplicates by checksum:

find /mnt/user/media -type f -exec md5sum {} \; | \
  sort | uniq -d -w 32 > duplicates.txt

Result: 4.2TB of duplicate files.

After careful verification (and a backup snapshot), started the cleanup. Flattened folder structure. Removed verified duplicates. Created .no-touchy marker files in critical directories.

Called dad. Explained the situation diplomatically.

“I didn’t do anything, the folders just… appeared.”

Sure, dad. The folders spontaneously generated themselves.

Set up proper folder permissions: read-only for browsing, write access only through proper interfaces.

Result: Recovered 4.2TB. Array now at 45% instead of 89%.

Day 2: Building the Command Center

Because clicking through 15 different UIs was getting old.

We had dashboards everywhere:

Proxmox web UI on Izar-Orchestrator
Proxmox web UI on Tarn-Host
Unraid dashboard on Meridian-Mako
Synology DSM on Cassiel-Silo
Individual container UIs…

The Stack

Dashboard: Homepage running in Docker
Monitoring: Glances on each host for system stats
Status: Uptime Kuma for service health
Deployment: Docker Compose on Altair-Link

The Glances Version Wrinkle

Proxmox hosts run Glances v3 (from Debian repos). Everything else runs v4. Homepage supports both, but you have to specify:

widget:
  type: glances
  version: 4  # or 3 for legacy hosts

Spent 30 minutes debugging “why isn’t Izar-Orchestrator showing up” before realizing the version mismatch.

Cross-Site Monitoring

The magic: Tailscale mesh VPN.

Altair-Link (Milky Way network) can see Tarn-Host (Andromeda network) with ~40ms latency. The dashboard pulls stats from both sites as if they were local.

Service Categories

Build Infrastructure — Gateway status, orchestrator status, drone fleet health

Hypervisors — Tarn-Host (Proxmox), Izar-Orchestrator (Proxmox)

Storage — Meridian-Mako-Silo (Unraid), Cassiel-Silo (Synology)

Media — Plex transcoder status, Tautulli activity

One URL: http://10.42.0.199:3001. All systems. Both sites. Real-time stats.

Day 2.5: The Power Crisis

15:42 — Power fluctuation at the Andromeda site. Not a full outage—just enough to trip the UPS and cause everything to reboot.

Impact

Down:

Tarn-Host (Proxmox hypervisor)
All VMs on Tarn-Host (including drone-Tarn, orchestrator-standby)
Meridian-Mako-Silo (Unraid)
Cassiel-Silo (Synology)
Media services (Plex, etc.)

Unaffected:

Milky Way site (local network)
Build swarm (degraded but functional)
Primary workstation

Recovery Timeline

15:42 — Alert from Uptime Kuma: “Tarn-Host: DOWN”

15:43 — Build swarm self-healing kicks in. Detects drone-Tarn offline. Reclaims 8 delegated jobs. Redirects work to remaining drones.

15:45 — Attempted SSH to Andromeda hosts: timeout

15:50 — Called dad: “Did the power flicker?” / “Yeah, just for a second”

16:05 — Tarn-Host back online (auto-start after power restore). 4 VMs starting simultaneously.

16:12 — All services restored.

What Worked

Uptime Kuma alerts — Knew immediately something was wrong
Build swarm self-healing — No manual intervention needed for builds
Proxmox auto-start — VMs came back on their own
That new dashboard — Could see the recovery in real-time

What Didn’t

No automated notification to dad — Had to call manually
VM boot order not optimized — Resource contention during simultaneous start
No remote power control — Couldn’t force restart anything remotely

Day 3: The Network Ghost

Day after the reboot crisis. Meridian-Mako-Silo was online according to Proxmox, but:

Web UI: unreachable
SSH: connection refused
Plex: offline
NFS shares: timeout

The server was running but invisible to the network.

Remote Diagnosis

Couldn’t SSH directly, but could access via Proxmox console passthrough:

Logged into Tarn-Host Proxmox
Opened Unraid VM console
Found the server sitting at a login prompt, looking innocent

Logged in locally. Network interfaces were… interesting.

ip addr show
# eth0: NO-CARRIER
# br0: state DOWN

The bridge interface had decided not to start properly.

Root Cause

During the power event, the VM started before the virtual network switch was fully initialized. Unraid grabbed an interface that didn’t exist yet, and never retried.

The Phone Call

Had to guide dad through opening the Proxmox console. Over the phone. While he was watching a football game.

“Click on the VM name. No, the name. The text. Yes. Now click Console. The button that says Console. No, not that one…”

We got there eventually.

Prevention

Added startup delay to the Unraid VM in Proxmox:

Start/Shutdown order: 3
Startup delay: 60

Also configured Unraid to be more aggressive about network recovery in the startup script.

Day 4: Dashboard Polish

With the crisis resolved, time to make the dashboard actually useful.

New Widgets

Tautulli Integration — Current Plex activity in real-time: active streams, transcoding status, bandwidth usage.

Overseerr Integration — Media request status: pending requests, recently added, processing queue.

Build Swarm Widget — Custom API integration showing drone count, queue depth, build rate.

- Build Swarm:
    icon: package.png
    widget:
      type: customapi
      url: http://10.42.0.199:8090/api/v1/status
      mappings:
        - field: drones_online
          label: Drones
        - field: packages_needed
          label: Queue
        - field: build_rate
          label: Rate
          format: number
          suffix: /hr

The Authentication Challenge

Half these services use different auth methods:

Tautulli: API key
Proxmox: Username + API token
Overseerr: API key
Custom APIs: No auth (internal only)

Created a template generator script to reduce copy-paste errors.

What I Learned

Monitoring saves you. Having the dashboard up during the power crisis meant I could see recovery happening in real-time instead of wondering what was broken.

Automation handles the common cases. The build swarm self-healed without intervention. That freed me to focus on the uncommon case (network ghost).

Remote access has limits. When you need to guide someone through console access over the phone, every extra step is painful. Simplify the recovery procedures.

Dad will reorganize folders. Accept this. Implement permissions.

Four days. From “we should have better monitoring” to “thank god we have better monitoring.”

Command center: operational.