Skip to main content
Back to Journal
user@argobox:~/journal/2026-03-10-the-silent-monitoring-problem
$ cat entry.md

The Silent Monitoring Problem

○ NOT REVIEWED

The Silent Monitoring Problem

55 Monitors, Zero Alerts

I've been running Uptime Kuma for months. 55 monitors checking everything — public websites, internal services, Docker containers, Proxmox hosts, NAS devices. Green dots, red dots, uptime percentages. Looks great on the dashboard.

One problem: there were zero notification channels configured. No Telegram alerts. No email. No webhooks. Nothing.

Every time a service went down, Uptime Kuma faithfully recorded it, calculated the downtime, updated the status page... and told nobody. I'd only find out when I happened to glance at the dashboard or when someone mentioned something wasn't working.

That's not monitoring. That's a history book.

Setting Up Alerts

Added a Telegram notification channel. Linked it to 20 critical monitors — the public-facing stuff (the main site, blog, Gitea, status page, the playground), plus key infrastructure (the Docker host, Proxmox nodes, both NAS devices).

Not all 55. Some monitors are informational — internal tools that I check manually, experimental services that I expect to be flaky. Alerting on everything leads to alert fatigue, and alert fatigue leads to ignoring alerts, which puts me right back where I started.

The Five-Day Ghost

While reviewing monitor data, I noticed Hypervisor Izar had been red for five days. March 5 through March 9. The status page showed 30.9% uptime for the week.

My first thought: false positive. Maybe the health check was misconfigured. Checked the heartbeat data — 8,937 consecutive "Request timeout" entries on port 8006. That's not a flaky check. That's a machine that was genuinely unreachable for five days.

It came back online on March 10. Responding fine now. But five days of downtime on a Proxmox host that runs VMs and containers... and I didn't know because my monitoring was silent.

That's the exact scenario monitoring is supposed to prevent. And it would have, if I'd configured a single notification channel at any point in the last few months.

OpenClaw: From 63 Calls to 9

Separate problem, same day.

OpenClaw was running 13 cron jobs making roughly 63 LLM calls per day across free-tier providers — Groq, Cerebras, SambaNova, OpenRouter. The free tiers have daily token limits. 63 calls a day was bumping against those limits constantly.

Then I restarted the container and wakeMode: "now" fired every single job simultaneously. All 13 jobs, all hitting the same providers at the same moment. Rate limits hit across the board. Fallback chains kicked in. OpenRouter started returning 400 errors because the gateway was sending malformed retry requests — an upstream bug that only surfaces under load.

The cascade looked like this: Groq exhausted → Cerebras exhausted → SambaNova exhausted → OpenRouter malformed → NVIDIA NIM rate-limited. Five providers, all hit within seconds.

The fix was boring: reduce the job count. I rewrote the cron schedule from 13 jobs down to 6:

  • Morning status check (daily, 8 AM)
  • Site and services health (every 8 hours)
  • Playground availability (every 6 hours)
  • Security audit (daily, 11 PM)
  • Weekly report (Sundays, 9 AM)
  • Integration connectivity test (Wednesdays, 10 AM)

Nine LLM calls a day instead of 63. The disabled jobs were redundant anyway — content scouts and competitive monitors and watchdogs that overlapped with what Uptime Kuma already does better.

I also expanded every agent tier to 5+ fallback providers. If Groq is down, try Cerebras, then SambaNova, then OpenRouter with Gemini, then Qwen, then NVIDIA NIM. Deep enough that a single provider outage doesn't cascade.

Status Page: The Split

One more thing. The public status page at argobox.com/status was showing a single uptime percentage that blended public services with homelab infrastructure. So when Hypervisor Izar went down for five days, the public uptime score tanked even though every client-facing service was fine.

That's misleading in both directions. Visitors see "97% uptime" and think the site's unreliable. But the site was at 99.9% — it's the homelab Proxmox node that dragged the number down.

Split it into two scores. "Public Uptime" only counts client-facing services — the website, blog, Gitea, status page, playground. "Infrastructure" shows the homelab stuff separately. Both visible, but the headline number is now what actually matters to someone visiting the site.

Public Uptime    Infrastructure    Services Online    Issues
    99.9%            97.2%            42/45              0

The build swarm monitors got removed from the status page entirely. Nobody visiting argobox.com needs to know whether my distributed build nodes are healthy. That's internal infrastructure. It has its own dashboard.

The Pattern

Three separate fixes today — Uptime Kuma alerting, OpenClaw cron reduction, status page split — but they're all the same problem: I built the system but didn't wire it to the thing that makes the system useful.

Monitoring without alerts. An AI agent making 63 API calls a day when 9 would do. A status page that mixed public health with internal chaos. All functional. None optimal. The kind of technical debt where everything technically works but nothing works the way it should.

...anyway. The Telegram alerts are live now. Next time Izar goes dark for five days, I'll know within minutes instead of discovering it by accident while debugging something else.