Four days. Storage archaeology. Dashboard building. A power crisis. Teaching dad to use Proxmox over the phone. This is the full story of building proper monitoring infrastructure—and immediately needing it.
Day 1: The Great Storage Cleanup
Remote monitoring showed Meridian-Mako-Silo (the Unraid server at the Andromeda site) was running low on space. 62TB array, supposedly plenty of room. Something was wrong.
SSH’d in and started exploring. Found the issue within minutes:
/mnt/user/media/movies/New Folder/New Folder/New Folder/...
Forty. Levels. Deep.
Each level contained partial copies of the same files. The array wasn’t running out of space—it was drowning in accidental duplicates.
The Archaeology
Spent two hours tracing the damage:
- Movies: 3 complete copies spread across nested folders
- TV Shows: 2 copies plus a third partial
- Photos: Thankfully untouched
- Backups: One folder that had been renamed 7 times
Timeline reconstruction suggests this happened over multiple “reorganizing” sessions. Each created a new folder, moved things, then got interrupted.
The Cleanup
Wrote a script to identify duplicates by checksum:
find /mnt/user/media -type f -exec md5sum {} \; | \
sort | uniq -d -w 32 > duplicates.txt
Result: 4.2TB of duplicate files.
After careful verification (and a backup snapshot), started the cleanup. Flattened folder structure. Removed verified duplicates. Created .no-touchy marker files in critical directories.
Called dad. Explained the situation diplomatically.
“I didn’t do anything, the folders just… appeared.”
Sure, dad. The folders spontaneously generated themselves.
Set up proper folder permissions: read-only for browsing, write access only through proper interfaces.
Result: Recovered 4.2TB. Array now at 45% instead of 89%.
Day 2: Building the Command Center
Because clicking through 15 different UIs was getting old.
We had dashboards everywhere:
- Proxmox web UI on Izar-Orchestrator
- Proxmox web UI on Tarn-Host
- Unraid dashboard on Meridian-Mako
- Synology DSM on Cassiel-Silo
- Individual container UIs…
The Stack
- Dashboard: Homepage running in Docker
- Monitoring: Glances on each host for system stats
- Status: Uptime Kuma for service health
- Deployment: Docker Compose on Altair-Link
The Glances Version Wrinkle
Proxmox hosts run Glances v3 (from Debian repos). Everything else runs v4. Homepage supports both, but you have to specify:
widget:
type: glances
version: 4 # or 3 for legacy hosts
Spent 30 minutes debugging “why isn’t Izar-Orchestrator showing up” before realizing the version mismatch.
Cross-Site Monitoring
The magic: Tailscale mesh VPN.
Altair-Link (Milky Way network) can see Tarn-Host (Andromeda network) with ~40ms latency. The dashboard pulls stats from both sites as if they were local.
Service Categories
Build Infrastructure — Gateway status, orchestrator status, drone fleet health
Hypervisors — Tarn-Host (Proxmox), Izar-Orchestrator (Proxmox)
Storage — Meridian-Mako-Silo (Unraid), Cassiel-Silo (Synology)
Media — Plex transcoder status, Tautulli activity
One URL: http://10.42.0.199:3001. All systems. Both sites. Real-time stats.
Day 2.5: The Power Crisis
15:42 — Power fluctuation at the Andromeda site. Not a full outage—just enough to trip the UPS and cause everything to reboot.
Impact
Down:
- Tarn-Host (Proxmox hypervisor)
- All VMs on Tarn-Host (including drone-Tarn, orchestrator-standby)
- Meridian-Mako-Silo (Unraid)
- Cassiel-Silo (Synology)
- Media services (Plex, etc.)
Unaffected:
- Milky Way site (local network)
- Build swarm (degraded but functional)
- Primary workstation
Recovery Timeline
15:42 — Alert from Uptime Kuma: “Tarn-Host: DOWN”
15:43 — Build swarm self-healing kicks in. Detects drone-Tarn offline. Reclaims 8 delegated jobs. Redirects work to remaining drones.
15:45 — Attempted SSH to Andromeda hosts: timeout
15:50 — Called dad: “Did the power flicker?” / “Yeah, just for a second”
16:05 — Tarn-Host back online (auto-start after power restore). 4 VMs starting simultaneously.
16:12 — All services restored.
What Worked
- Uptime Kuma alerts — Knew immediately something was wrong
- Build swarm self-healing — No manual intervention needed for builds
- Proxmox auto-start — VMs came back on their own
- That new dashboard — Could see the recovery in real-time
What Didn’t
- No automated notification to dad — Had to call manually
- VM boot order not optimized — Resource contention during simultaneous start
- No remote power control — Couldn’t force restart anything remotely
Day 3: The Network Ghost
Day after the reboot crisis. Meridian-Mako-Silo was online according to Proxmox, but:
- Web UI: unreachable
- SSH: connection refused
- Plex: offline
- NFS shares: timeout
The server was running but invisible to the network.
Remote Diagnosis
Couldn’t SSH directly, but could access via Proxmox console passthrough:
- Logged into Tarn-Host Proxmox
- Opened Unraid VM console
- Found the server sitting at a login prompt, looking innocent
Logged in locally. Network interfaces were… interesting.
ip addr show
# eth0: NO-CARRIER
# br0: state DOWN
The bridge interface had decided not to start properly.
Root Cause
During the power event, the VM started before the virtual network switch was fully initialized. Unraid grabbed an interface that didn’t exist yet, and never retried.
The Phone Call
Had to guide dad through opening the Proxmox console. Over the phone. While he was watching a football game.
“Click on the VM name. No, the name. The text. Yes. Now click Console. The button that says Console. No, not that one…”
We got there eventually.
Prevention
Added startup delay to the Unraid VM in Proxmox:
Start/Shutdown order: 3
Startup delay: 60
Also configured Unraid to be more aggressive about network recovery in the startup script.
Day 4: Dashboard Polish
With the crisis resolved, time to make the dashboard actually useful.
New Widgets
Tautulli Integration — Current Plex activity in real-time: active streams, transcoding status, bandwidth usage.
Overseerr Integration — Media request status: pending requests, recently added, processing queue.
Build Swarm Widget — Custom API integration showing drone count, queue depth, build rate.
- Build Swarm:
icon: package.png
widget:
type: customapi
url: http://10.42.0.199:8090/api/v1/status
mappings:
- field: drones_online
label: Drones
- field: packages_needed
label: Queue
- field: build_rate
label: Rate
format: number
suffix: /hr
The Authentication Challenge
Half these services use different auth methods:
- Tautulli: API key
- Proxmox: Username + API token
- Overseerr: API key
- Custom APIs: No auth (internal only)
Created a template generator script to reduce copy-paste errors.
What I Learned
Monitoring saves you. Having the dashboard up during the power crisis meant I could see recovery happening in real-time instead of wondering what was broken.
Automation handles the common cases. The build swarm self-healed without intervention. That freed me to focus on the uncommon case (network ghost).
Remote access has limits. When you need to guide someone through console access over the phone, every extra step is painful. Simplify the recovery procedures.
Dad will reorganize folders. Accept this. Implement permissions.
Four days. From “we should have better monitoring” to “thank god we have better monitoring.”
Command center: operational.