K3s Stability Tuning and Reboot Loop Diagnosis
Date: 2026-01-21 Duration: About an hour Issue: Continuous reboot loop Root Cause: K3s pod crashes → kernel panic → auto-reboot → repeat
The Problem
Alpha-Centauri kept rebooting every 1-3 minutes with no clear user-space error first.
SSH in, run a command, and before I could finish typing, the connection dropped. System was back up 30 seconds later. Started the investigation again. Dropped again.
The Timeline
18:30 - System boot
18:33 - System reboot (3 min uptime)
18:34 - System boot
18:35 - System reboot (1 min uptime)
18:36 - System boot
18:39 - System reboot (3 min uptime)
... continues for 45 minutes ...
The system needed to be stabilized before deeper debugging could continue.
The Hunt
Had to work fast. SSH in, run one command, copy the output before the connection died.
First check: what happened before the last reboot?
journalctl -b -1 -n 50
CNI bridge state changes. Hundreds of them.
cni0: port 1(veth...) entered disabled state
cni0: port 1(veth...) entered forwarding state
cni0: port 1(veth...) entered disabled state
The Kubernetes CNI network was thrashing. Interfaces being created and destroyed faster than I could scroll.
The Pods
Checked K3s:
kubectl get pods --all-namespaces
NAME READY STATUS RESTARTS
openwebui-xxx 0/1 CrashLoopBackOff 47
quartz-vault-xxx 0/1 CrashLoopBackOff 39
Two pods crash-looping. 47 restarts. 39 restarts. They'd been crashing for hours.
Every crash:
- Pod dies
- Network namespace destroyed
- CNI bridge interface removed
- Pod restarts
- Network namespace created
- CNI bridge interface added
- Pod crashes again
- Repeat
Hundreds of network interface state changes per minute.
The Kernel Panic
The network stack couldn't handle it. All those rapid interface state changes, the memory churn from pod restarts, the CNI bridge thrashing — something broke deep in the kernel.
Panic.
But I didn't see the panic. Because of this:
sysctl kernel.panic
# kernel.panic = 10
Ubuntu's default. When the kernel panics, wait 10 seconds, then automatically reboot.
For production servers with monitoring, this is smart — automatic recovery.
For debugging, this is a problem: the system reboots before you can read the panic message.
The Loop
The full sequence:
- System boots
- K3s starts
- Pods start crashing (within 30 seconds)
- CNI network thrashes
- Kernel panics (1-3 minutes)
- Wait 10 seconds
- Automatic reboot
- Return to step 1
That loop repeated until the automatic reboot behavior was disabled for diagnosis.
The Fix
First, break the reboot loop:
sysctl -w kernel.panic=0
echo 'kernel.panic = 0' >> /etc/sysctl.conf
Now the system will halt on panic instead of rebooting. Not ideal for production, but essential for debugging.
Second, stop the churn:
systemctl stop k3s
No K3s, no pod crashes, no CNI thrashing, no kernel panic.
The system stayed up. First stable boot in an hour.
Was It My Code?
I was running a custom gateway service. First instinct: I broke something.
Searched the entire codebase:
grep -r "reboot\|shutdown.*-r\|systemctl.*reboot" bin/ lib/ scripts/
Zero matches. My code doesn't reboot anything.
Checked the systemd service:
[Service]
Restart=always
Restart the process on failure. Not reboot the system.
The gateway was a victim, not a perpetrator. It crashed because the kernel underneath it crashed.
The Actual Culprit
Primary cause: K3s pods crash-looping
Contributing factors:
- constrained RAM shared between K3s, monitoring, gateway, and other services
- Aggressive pod restart policy
- CNI network bridge instability under rapid state changes
Trigger: Something caused the pods to start crashing (OOM? config error? dependency failure?)
Amplifier: kernel.panic=10 turned crashes into an unbreakable loop
The Lessons
Ubuntu's panic default can hide root cause during development. Set kernel.panic=0 on any machine you might need to debug.
K3s on constrained hosts needs resource discipline. The control plane alone wants meaningful headroom. Add pods, and resource limits matter.
Crash-looping pods can take down a host. The CNI network changes cascade into kernel-level instability. Resource limits and proper health checks matter.
Check system logs before blaming your code. I spent 15 minutes suspecting my gateway before checking journalctl. The kernel panic was right there in the logs.
The Prevention
Immediate:
kernel.panic=0prevents the reboot loop- K3s stopped until pods are fixed
Short-term:
- Fix or delete the crash-looping pods
- Add resource limits to K3s workloads
- Migrate gateway to isolated LXC container
Long-term:
- Dedicated K3s node(s) with more RAM
- Proper monitoring with reboot alerts
- Health checks that prevent infinite crash loops
The kernel wasn't the first cause. The useful lesson was that rapid pod churn, constrained resources, and automatic panic reboot behavior can combine into a loop that hides the actual fix.