K3s Stability Tuning and Reboot Loop Diagnosis

Date: 2026-01-21 Duration: About an hour Issue: Continuous reboot loop Root Cause: K3s pod crashes → kernel panic → auto-reboot → repeat

The Problem

Alpha-Centauri kept rebooting every 1-3 minutes with no clear user-space error first.

SSH in, run a command, and before I could finish typing, the connection dropped. System was back up 30 seconds later. Started the investigation again. Dropped again.

The Timeline

18:30 - System boot
18:33 - System reboot (3 min uptime)
18:34 - System boot
18:35 - System reboot (1 min uptime)
18:36 - System boot
18:39 - System reboot (3 min uptime)
... continues for 45 minutes ...

The system needed to be stabilized before deeper debugging could continue.

The Hunt

Had to work fast. SSH in, run one command, copy the output before the connection died.

First check: what happened before the last reboot?

journalctl -b -1 -n 50

CNI bridge state changes. Hundreds of them.

cni0: port 1(veth...) entered disabled state
cni0: port 1(veth...) entered forwarding state
cni0: port 1(veth...) entered disabled state

The Kubernetes CNI network was thrashing. Interfaces being created and destroyed faster than I could scroll.

The Pods

Checked K3s:

kubectl get pods --all-namespaces

NAME                    READY   STATUS              RESTARTS
openwebui-xxx           0/1     CrashLoopBackOff    47
quartz-vault-xxx        0/1     CrashLoopBackOff    39

Two pods crash-looping. 47 restarts. 39 restarts. They'd been crashing for hours.

Every crash:

Pod dies
Network namespace destroyed
CNI bridge interface removed
Pod restarts
Network namespace created
CNI bridge interface added
Pod crashes again
Repeat

Hundreds of network interface state changes per minute.

The Kernel Panic

The network stack couldn't handle it. All those rapid interface state changes, the memory churn from pod restarts, the CNI bridge thrashing — something broke deep in the kernel.

Panic.

But I didn't see the panic. Because of this:

sysctl kernel.panic
# kernel.panic = 10

Ubuntu's default. When the kernel panics, wait 10 seconds, then automatically reboot.

For production servers with monitoring, this is smart — automatic recovery.

For debugging, this is a problem: the system reboots before you can read the panic message.

The Loop

The full sequence:

System boots
K3s starts
Pods start crashing (within 30 seconds)
CNI network thrashes
Kernel panics (1-3 minutes)
Wait 10 seconds
Automatic reboot
Return to step 1

That loop repeated until the automatic reboot behavior was disabled for diagnosis.

The Fix

First, break the reboot loop:

sysctl -w kernel.panic=0
echo 'kernel.panic = 0' >> /etc/sysctl.conf

Now the system will halt on panic instead of rebooting. Not ideal for production, but essential for debugging.

Second, stop the churn:

systemctl stop k3s

No K3s, no pod crashes, no CNI thrashing, no kernel panic.

The system stayed up. First stable boot in an hour.

Was It My Code?

I was running a custom gateway service. First instinct: I broke something.

Searched the entire codebase:

grep -r "reboot\|shutdown.*-r\|systemctl.*reboot" bin/ lib/ scripts/

Zero matches. My code doesn't reboot anything.

Checked the systemd service:

[Service]
Restart=always

Restart the process on failure. Not reboot the system.

The gateway was a victim, not a perpetrator. It crashed because the kernel underneath it crashed.

The Actual Culprit

Primary cause: K3s pods crash-looping

Contributing factors:

constrained RAM shared between K3s, monitoring, gateway, and other services
Aggressive pod restart policy
CNI network bridge instability under rapid state changes

Trigger: Something caused the pods to start crashing (OOM? config error? dependency failure?)

Amplifier: kernel.panic=10 turned crashes into an unbreakable loop

The Lessons

Ubuntu's panic default can hide root cause during development. Set kernel.panic=0 on any machine you might need to debug.

K3s on constrained hosts needs resource discipline. The control plane alone wants meaningful headroom. Add pods, and resource limits matter.

Crash-looping pods can take down a host. The CNI network changes cascade into kernel-level instability. Resource limits and proper health checks matter.

Check system logs before blaming your code. I spent 15 minutes suspecting my gateway before checking journalctl. The kernel panic was right there in the logs.

The Prevention

Immediate:

kernel.panic=0 prevents the reboot loop
K3s stopped until pods are fixed

Short-term:

Fix or delete the crash-looping pods
Add resource limits to K3s workloads
Migrate gateway to isolated LXC container

Long-term:

Dedicated K3s node(s) with more RAM
Proper monitoring with reboot alerts
Health checks that prevent infinite crash loops

The kernel wasn't the first cause. The useful lesson was that rapid pod churn, constrained resources, and automatic panic reboot behavior can combine into a loop that hides the actual fix.