Forty-Two Files, One Saturday

The Motivation

I ran a full site audit with Claude — 511 checks. 441 passed, 44 warnings, 26 failures. Some of the failures were embarrassing — like the user settings page that had zero authentication. Just... open. Anyone could hit /user/settings and it'd render.

But the pattern that really got to me was error messages. Almost every API endpoint in ArgoBox was returning error.message directly to the client. Which means if something breaks, the user sees the internal error string. File paths. Database connection details. Internal service URLs. Stack traces in the middleware's 500 handler.

It's the kind of thing that works fine until someone's poking at your API and you're handing them a map of your internal architecture for free.

Phase 1: Error Sanitization

Started with the public-facing endpoints. Every catch block that returned error.message got replaced with a generic "Internal error" response, with the real error logged server-side only.

Then the admin proxy handlers — 9 of them. Same pattern. The proxy endpoints for the build swarm, the job API, and the swarm admin were all including the internal URL they were trying to reach in the error response. "Failed to connect to http://10.0.0.xxx:8585" — cool, now the client knows my internal IP and port layout.

The middleware's 500 handler was the worst. Full stack trace, sent to the client, in production. Fixed that to only show traces in development mode.

10 files, first commit.

Phase 2: Consistency and Infrastructure

While I was in there, I kept finding things that weren't security bugs but were just... wrong.

The API keys endpoint returned 400 when the encryption key was missing. That's a server configuration problem, not a client error. Changed it to 503.

The ingest endpoint returned { available: false } with a 200 status when the daemon was down. A 200 means "everything's fine." The daemon being unavailable is not fine. Changed it to 503.

Both knowledge endpoints had no input bounds. You could request limit=999999 and the server would try to comply. Added bounds — 1 to 100 for knowledge, 1 to 200 for ingest.

Then I found the KV cache bug. In runtime-env.ts, the _kvKeysLoaded flag was being set to true before the KV read completed. So if the read failed, the flag was already set and it would never retry. If the read succeeded, great, but the success path wasn't special. The flag should only flip on success. Subtle but real — and it meant a transient KV failure during startup would permanently skip the cache.

Also built a shared api-response.ts utility. Every endpoint was writing its own JSON response shape. Now they all use apiError() and apiSuccess() with consistent structure.

5 files, plus the new utility.

Phase 3: RAG Pipeline

Took a hard look at the RAG system while I was in code-review mode.

The chunker is solid — paragraph-aware with 400-word chunks and 80-word overlap. BM25 search through FTS5 is excellent. But the embedder was calling Ollama one chunk at a time instead of using the batch /api/embed endpoint. And vector search was brute-force O(n) — no approximate nearest neighbor index.

Fixed the embedder to batch embeddings. Added a composite index on the chunks table for faster vector scans. And added dimension validation in setEmbeddings() — previously, if you switched embedding models and the dimensions didn't match, it would silently store mismatched vectors. Now it throws.

The ANN index is a bigger project. Would need sqlite-vss or an external library. Left that for later.

Phase 4: The Remaining 20

After the first three commits I realized I'd only covered about half the admin API handlers. Went through the remaining 20 endpoints. Same pattern — replace error.message with generic responses, use the shared utility, log real errors server-side.

42 files total across 3 commits.

The Audit Results

That audit I mentioned? The 26 failures are now mostly addressed:

Settings page has an auth gate
Error messages don't leak internals
Status codes are correct
Input bounds are enforced
KV cache actually retries on failure

There's still work to do — admin rate limiting, module config gaps, persistent ingestion state. But the surface area where my API was actively helping attackers understand my infrastructure? That's closed.

One More Thing

While reviewing the admin panel, I found that a page using Astro's prerender = true was bypassing Cloudflare Access auth entirely. Prerendered pages get served as static HTML — they don't go through the middleware where the auth check lives. So the page was just... publicly accessible.

The fix was simple: don't prerender pages that need auth. But the fact that it was there at all means the Astro/Cloudflare interaction has edge cases that aren't obvious. Prerender is a performance optimization. It shouldn't also be a security bypass. But in this architecture, it is.

Wrote that one down in big letters.