From 'What Is Kubernetes?' to Production Containers: The K3s Journey

Update January 2026: This experiment was the seed that grew into the Build Swarm. While I eventually hit the limits of K3s on low-memory hardware (the infamous OOM crisis of January 2026), the lessons learned here about declarative infrastructure were foundational. See How I Solved Gentoo’s Compile Problem for where this journey led.

From “What Is Kubernetes?” to Production Containers

“How hard can Kubernetes be?”

Famous last words. The kind of sentence that, in hindsight, deserves a laugh track. I had a vision: a self-healing, declarative infrastructure where I never had to SSH into a server to restart a service again. No more systemctl restart whatever. No more “did you remember to start that after the reboot?” conversations with myself at midnight.

I also had a constraint: modest hardware. Not a rack of Dell PowerEdge servers. Not a cloud budget. An Intel NUC and a Proxmox VM.

This is the story of 16 months with K3s — Rancher’s lightweight Kubernetes distribution — from the first curl | sh to a production cluster running 14 pods on under 5 gigs of RAM. It includes every wrong turn, every 2 AM debugging session, and the networking bug that cost me two hours of my life over a missing /api suffix.


The Hardware (December 2023)

The ambition was bigger than the hardware. Always is.

  • Master Node: Altair-Link — Intel NUC, 16GB RAM, dual-core. The brains.
  • Worker Node: Arcturus-Prime — VM on Proxmox, 8GB RAM. The muscle. Well, the “muscle.”

8GB of RAM for a Kubernetes worker node. I’ve since learned that Kubernetes people hear that number and wince. At the time, I thought it was plenty. K3s is “lightweight,” right? The marketing says so.

The marketing is technically correct. K3s itself is lightweight. It’s everything you run on top of it that isn’t.

But I’m getting ahead of myself.

The First Experiment

K3s actually lives up to its promises during installation. It’s a single binary. No dependency hell, no “install these 14 prerequisites first” dance. Just a curl command and a prayer.

# On Master (Altair-Link)
curl -sfL https://get.k3s.io | sh -s - server \
  --disable traefik \
  --write-kubeconfig-mode 644

# On Worker (Arcturus-Prime)
curl -sfL https://get.k3s.io | K3S_URL=https://10.42.0.199:6443 K3S_TOKEN=... sh -

Within 30 minutes, I had a functional cluster. Thirty minutes. From zero Kubernetes knowledge to kubectl get nodes showing two Ready nodes. I leaned back in my chair. I felt like a cloud commander. I had orchestration. I had a control plane. I had a worker node awaiting instructions.

This feeling lasted approximately four hours.


The Database Migration Disaster

My first real task was migrating OpenWebUI from a local SQLite database to a proper PostgreSQL instance running inside the cluster. SQLite locks during concurrent writes, which means if two people try to use it at the same time, one of them gets a database lock error and a bad experience. PostgreSQL doesn’t have this problem. Simple enough migration, right?

I wrote my first Deployment manifest. Deployed PostgreSQL. Pointed OpenWebUI at it. It worked. Data flowed. I was a genius.

Then I restarted the pod.

Everything was gone. Every conversation, every setting, every piece of data. Just… gone. The container came back up with a fresh, empty database like nothing had ever happened. Because, technically, nothing had happened — not in any way that persisted beyond the container’s ephemeral filesystem.

This is the PersistentVolumeClaim lesson, and you will learn it exactly once. Kubernetes pods are ephemeral by design. When a pod restarts, its filesystem resets to the container image’s default state. If you want data to survive a restart — and you do, you very much do — you need a PVC. Without one, you’re writing to a whiteboard and then erasing it every time you close the lid.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: postgres-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi

Seven lines of YAML that would have saved me an evening of re-importing data. I added the PVC, mounted it into the PostgreSQL pod, updated the Deployment, and redeployed. This time, the data survived restarts. This time, multiple users could chat simultaneously without SQLite locking up.

The DATABASE_URL connection string looked like this:

postgresql://user:pass@postgres-service:5432/openwebui

That postgres-service bit is a Kubernetes Service — basically an internal DNS name that routes traffic to the right pod. Elegant, once you know it exists. Invisible and confusing if you don’t.

The PVC thing seems obvious in retrospect. It’s always obvious in retrospect. But when you’re coming from Docker where you just mount a -v /host/path:/container/path and call it a day, the Kubernetes abstraction layer catches you off guard. Volumes exist. Claims exist. Storage classes exist. And if you skip any of them, your data lives exactly as long as your pod does.


The Networking Trap

If the PVC lesson cost me an evening, the networking lesson cost me my sanity. Or at least two hours of it on what was supposed to be a quick Saturday afternoon project.

OpenWebUI needed to talk to Ollama (the local inference engine) running on Arcturus-Prime. Simple HTTP request. One service calling another. This is what Kubernetes is for.

Attempt 1: http://localhost:11434

Failed. Of course it failed. But my muscle memory from Docker Compose typed it before my brain caught up. In Kubernetes, each pod has its own network namespace. localhost inside a pod refers to that pod, not the host machine, not any other pod, not the node it’s running on. Localhost means yourself. Always. This is one of those facts that seems trivially obvious when someone tells you, and completely invisible when you’re troubleshooting at 3 PM with the confidence of someone who’s been running containers for two whole days.

Attempt 2: http://10.42.0.100:11434

Progress. Sort of. The connection went through — no more “connection refused.” But I got a 404. Not a timeout, not a connection error. A clean, confident 404. The service was there, it was responding, and it was telling me to go away. This was more confusing than the outright failure, because a 404 means “I hear you, but I don’t know what you want.” The port was right. The IP was right. What was wrong?

Attempt 3: http://10.42.0.100:11434/api

Success.

Two hours. Two hours of debugging, of checking firewall rules, of restarting pods, of reading Ollama documentation, of questioning my career choices. The fix was adding /api to the URL. Six characters. A forward slash and three letters.

Kubernetes networking is 90% of the battle. The other 10% is also networking, but you just haven’t found it yet. Every service has its own IP. Every pod has its own namespace. DNS resolution works differently depending on whether you’re inside or outside the cluster. Port mappings are not port forwards. NodePorts are not the same as ClusterIPs. And localhost never means what your Docker instincts tell you it means.

I wrote this in my notes that Saturday: “Networking. It’s always networking.” I have not had reason to revise this opinion.


The Memory Leak (Foreshadowing)

December 2023 into January 2024. Everything ran great for about a month. I had PostgreSQL, OpenWebUI, and a handful of smaller services all humming along. kubectl get pods showed healthy green across the board. I’d check the cluster dashboard in the morning with my coffee, see everything running, and feel a quiet satisfaction.

But looking back at my monitoring logs from that period, the warning signs were all there.

Arcturus-Prime — the worker with 8GB RAM — was consistently sitting at 85% memory usage. Not spiking to 85%. Sitting there. Resting comfortably at 85% like that was just where it lived now. K3s itself was lean, maybe 500MB for the kubelet and its friends. But the workloads piled up. PostgreSQL wanted its shared buffers. OpenWebUI had its own footprint. Every container image loaded into memory. Every log buffer, every connection pool, every cache.

The thing about 85% memory usage is that it works fine right up until the moment it doesn’t. One extra query, one log rotation that caches too aggressively, one workload that decides to allocate a buffer, and you’re at 100%. And at 100%, the OOM killer wakes up, picks a process it doesn’t like, and shoots it. Usually the most important one. Always the most important one. It’s practically a law of distributed systems: the OOM killer has perfect aim for the thing you care about most.

K3s is lightweight. Your workloads are not. Java apps, databases, inference engines — they eat RAM for breakfast and then ask for seconds. The K3s binary might be 50MB, but that PostgreSQL instance wants a gigabyte and the inference engine wants four. “Lightweight Kubernetes” means the orchestrator is light, not that it magically makes everything else light too. Budget for your actual workloads, not the control plane overhead.

In late 2023, I was blissfully happy with my cluster. The 85% memory number was a datapoint I noted and moved on from. It would come back to haunt me in spectacular fashion during the OOM Crisis of January 2026, but that’s a story for another post. A much longer, much angrier post.


The Low-Power Pivot (July 2025)

By mid-2025, I’d learned something fundamental: not every workload belongs on the same hardware. The heavy stuff — compilation, inference, anything that eats CPU cores for fun — that belonged on beefy machines that could be turned off when not in use. But there’s a category of services that just need to exist, 24/7, quietly, on as little power as possible.

I set up a dedicated low-power box for these always-on lifecycle services. Not a powerhouse, but reliable. Fanless. Sits in a closet. Draws about 35 watts. The kind of machine you forget exists until you check Uptime Kuma and realize it’s been running for 42 days without a hiccup.

Single-node K3s cluster with a stripped-down install:

curl -sfL https://get.k3s.io | sh -s - server \
  --disable traefik \
  --node-name homeserver \
  --write-kubeconfig-mode 644

Why disable Traefik? Because I manage my own ingress with Cloudflare Tunnels for external access. Traefik is fine if you want a built-in solution, but I already had a setup I trusted, and running two ingress controllers leads to routing conflicts that are exactly as fun to debug as they sound.

The services that landed on this box:

  • VS Code Server — Remote IDE access from anywhere. Because sometimes I’m on my laptop, sometimes I’m on my phone (don’t judge me), and the development environment should follow me.
  • Uptime Kuma — Monitoring and alerting. Pings everything on the network every 60 seconds and sends me a notification when something goes down. Which is more often than I’d like to admit.
  • Mosquitto — MQTT broker for IoT sensors. Temperature, humidity, motion — the house talks to this service.
  • Zigbee2MQTT — Home automation bridge. Zigbee devices speak their own protocol, and this translates it into something the rest of the stack can understand.

Total resource usage for all of it: under 5GB RAM out of 16GB available. Memory utilization hovering around 26%. After the anxiety of watching Arcturus-Prime sweat at 85%, this felt like luxury. Room to breathe. Room for the OOM killer to stay asleep.


GitOps with Flux

Here’s where things got interesting. And by interesting, I mean “where I stopped treating the cluster like a pet and started treating it like cattle.”

No manual kubectl apply commands. That path leads to configuration drift — the slow, creeping divergence between what you think is running and what’s actually running. You apply a quick fix at midnight, forget to commit it, and three months later you’re staring at a manifest that doesn’t match reality and you have no idea which version is correct.

Instead, Flux CD watches a Git repository and reconciles the cluster state automatically:

  1. I commit a change to the infrastructure repo.
  2. Flux (running inside the cluster) detects the new commit.
  3. Flux pulls the changes and applies them to the cluster.
  4. The cluster state matches Git. Always.

The beautiful part — the part that justifies all the YAML and all the learning curve — is disaster recovery. If the hardware dies, the recovery process is:

  1. Provision a new box.
  2. Install K3s (one command).
  3. Bootstrap Flux (one command).
  4. Walk away.

Flux reads the Git repo, sees the desired state, and rebuilds everything. Every deployment, every service, every config map, every PVC definition. The entire stack reconstructs itself from version-controlled declarations. No runbooks. No “step 47: remember to restart the MQTT broker.” Just Git as the single source of truth.

The declarative dream actually works. I was skeptical. Years of imperative system administration — SSH in, run commands, hope you remember what you did — made me doubt that a purely declarative approach could handle real-world messiness. But it does. Flux has caught drift I didn’t even know existed. A manual change I made during debugging that I forgot to revert? Flux reverted it for me on the next reconciliation cycle. Annoying in the moment. Invaluable in the long run.


What I’d Do Differently

Sixteen months of hindsight is a powerful thing. Not regrets exactly, but course corrections I’d make if I were starting over.

Don’t run K3s on WiFi. I did this for approximately one afternoon before the packet loss made everything flaky. Kubernetes is chatty — the control plane, etcd, pod-to-pod communication, health checks — and WiFi’s latency jitter turns “chatty” into “unreliable.” I moved to Ethernet immediately and the intermittent failures vanished. If you’re setting up a homelab cluster and you’re thinking “WiFi should be fine for now,” it won’t be. Run the cable.

Local-path provisioning ties your pods to specific nodes. The built-in local-path-provisioner is the default storage option in K3s, and it’s simple and it works. But “simple” has a cost: your data lives on a specific node’s filesystem. If that node dies, the data is trapped on the dead node’s disk. Your pod can’t reschedule to another node because the PVC is bound to the original node’s local storage. For a single-node cluster, this is fine. For a multi-node cluster, it’s a ticking time bomb. Start with a network-attached storage solution — NFS, Longhorn, anything that isn’t local-path — if you plan to have more than one node.

400 lines of YAML to replace one docker run command. This is the honest truth. The Deployment manifest, the Service, the PVC, the ConfigMap, the Ingress, the namespace — by the time you’ve fully declared a single application in Kubernetes, you’ve written more YAML than some people write in a year. And every line matters. One wrong indentation and the whole thing fails silently or, worse, deploys something subtly wrong.

Is it worth it? Yes. But not for the reason most Kubernetes evangelists will tell you. It’s not worth it because YAML is pleasant to write (it isn’t). It’s worth it because six months later, when something breaks, you can look at the manifest and understand exactly what’s supposed to be running, how it’s configured, and what changed. Try doing that with a docker run command you typed into a terminal eight months ago. I’ll wait.


Where It Stands Now

The low-power box is the quiet heart of the homelab. The numbers as of the last check:

  • Uptime: 42 days
  • Pods: 14 running
  • Memory: 4.2GB / 16GB used
  • Power draw: ~35W

It sits in a closet, runs silently, and just works. That’s the highest compliment I can give any piece of infrastructure: it’s boring. It doesn’t page me. It doesn’t surprise me. It doesn’t crash at 3 AM. It just runs.

The heavy-duty work — compilation, builds, anything that needs raw compute — moved to the Build Swarm. That’s a separate system entirely, born from the lessons learned during the K3s experiments. K3s taught me that declarative infrastructure is worth the investment. The Build Swarm took that principle and scaled it across multiple physical hosts for distributed Gentoo compilation.

K3s still handles the lifecycle services. The things that need to be always-on, always-available, always-stable. VS Code Server so I can code from anywhere. Uptime Kuma so I know when things break. Mosquitto and Zigbee2MQTT so the smart home keeps being smart.


K3s vs Docker Compose: The Honest Answer

People ask me this. “Why not just use Docker Compose?” Fair question. For a single application on a single host, Docker Compose is faster to set up, easier to understand, and perfectly adequate. I won’t pretend otherwise.

But K3s gives me things Compose doesn’t:

Declarative config that actually means something. Docker Compose files are declarative-ish, but they don’t self-heal. If a container crashes in Compose, it stays crashed until you notice. In K3s, the pod restarts automatically. The desired state is enforced continuously, not just at startup.

GitOps integration. Flux watches the repo and reconciles. With Compose, you’d need to build that pipeline yourself — file watchers, deployment scripts, state verification. It’s doable, but it’s reinventing a wheel that Kubernetes already has.

Self-healing that actually works. A crashed pod gets restarted. A failed health check triggers a replacement. A node goes down and workloads reschedule (assuming you’ve solved the storage problem, which I’ve harped on enough). This isn’t theoretical — I’ve watched pods recover from OOM kills, network blips, and my own bad configuration without intervention.

Future-proofing. Adding a second node to a Docker Compose setup means rewriting your entire deployment strategy. Adding a second node to K3s means running one command on the new machine. The workloads are already described in a way that spans nodes. The orchestrator handles placement.

The learning curve is real. The YAML is verbose. The networking model requires a mental shift from Docker’s simpler approach. I spent weeks learning things the hard way that a docker-compose.yml would have given me in an afternoon.

The payoff is real too. I don’t SSH into servers anymore. I don’t restart services manually. I don’t worry about configuration drift. I commit to Git and the infrastructure converges to the desired state. That’s worth the weeks of learning. That’s worth the 400 lines of YAML.


Sixteen Months Later

December 2023, I typed “How hard can Kubernetes be?” into the void and the void answered with a missing PVC, a rogue /api suffix, and 85% memory utilization that was quietly counting down to disaster.

July 2025, I installed K3s on a low-power box and it’s been running 14 pods on 35 watts ever since, managed entirely through Git commits and Flux reconciliation.

The gap between those two dates is filled with YAML, debugging, and the slow realization that container orchestration isn’t about the containers. It’s about the orchestration. It’s about declaring what you want and letting the system figure out how to make it happen. It’s about recovering from failures automatically instead of manually. It’s about treating infrastructure as code and meaning it.

K3s on modest hardware, in a homelab, managed with GitOps. It’s not a production cloud. It’s not enterprise-grade. But it’s mine, it works, and the things I learned building it became the foundation for everything that came after.

Would I do it again? In a heartbeat. Probably with more RAM though.