The Crash You Don't See Coming

The Setup

March 1, 11:41 PM. Emerge started with 1,556 packages. March 2, 10:02 AM — still running. Some packages compile locally even with a binhost; X11 libraries need careful handling.

I wasn't monitoring it closely. The workstation was on. I was doing other stuff. Not thinking about the emerge at all.

What Actually Happened

The @world update process works like this:

unmerge old version of package
merge new version of package

Standard, boring, happens millions of times a day across Linux systems everywhere.

But if you're running KDE while emerge unmerges libX11, here's what happens: the display server depends on that library. Right now. This second. The library is in use.

Emerge doesn't care. It unmerges anyway.

The GPU driver tries to call a library function. The function doesn't exist anymore — the .so file is gone from disk. The call goes to a null pointer. The GPU driver panics. The kernel locks up while waiting for GPU operations to complete.

Process after process hits D-state: uninterruptible sleep. Waiting for I/O that will never complete.

The keyboard doesn't work. The mouse doesn't work. The network is fine but you can't interact with the system.

The screen shows... nothing. Just frozen. Whatever was on display last.

How I Found Out

I didn't find out on March 2. I went to bed. Woke up on March 4.

The emerge had crashed somewhere between March 2 10:02 AM and whenever I woke up. Days went by. I didn't boot the system until March 4 when I actually needed it.

That's how silent this failure was.

What I Learned (Too Late)

Everyone who uses Gentoo knows the answer: run @world updates from a TTY.

sudo rc-update del display-manager default
sudo reboot
# [login to text console]
sudo emerge @world --keep-going
sudo reboot

That's the formula. Boot to text mode. Run the update without a display server consuming X11 libraries. Reboot when done.

I know this. I've known this for years. I just... didn't do it.

That's the part that gets you in infrastructure work. You know what the best practice is. You know it's the best practice because you've read it in three different places and seen the consequences. But then you take a shortcut because "it's just a binary install, should be fast" and then 14 days later you're debugging a system that locked up while you were asleep.

The Hidden Failure

What makes this particularly dangerous: a system can lock up in the background and you have no idea.

If I had been watching the screen, I would've seen it freeze instantly. I would've been like, "oh, something went wrong, force reboot."

But I wasn't watching. The system locked up, I went to bed, the disk was on a btrfs filesystem with snapper configured so the system was taking snapshots of the locked-up state, and time just passed.

When I finally booted the system 4 days later, the damage was already done. The emerge had crashed. libxcrypt was halfway recompiled. /etc/shadow was corrupted. The system was in a state I couldn't even diagnose until I booted from OpenSUSE on the other partition.

If I had been actively using the system, I would've hit the crash immediately and investigated right away. The 4-day delay made everything worse.

The Next 14 Days

This crash is the reason I spent the next two weeks writing recovery scripts, debugging libxcrypt, rebuilding packages from a chroot, and learning more about PAM authentication than I ever wanted to.

March 4: discovered the emerge completed (mostly) but system is unbootable. March 6: found the root cause (libxcrypt lost SHA512 support). March 9: deployed a 28KB recovery script. March 10: fixed shadow corruption, got login working.

All of that was downstream of the decision to run the update while KDE was running.

What It Feels Like

When the kernel locks up in D-state, you don't get an error message. You don't get a kernel panic that reboots. You get silence. The system is frozen at a hardware level — the CPU is waiting for memory or disk I/O that will never complete.

From the user's perspective: the computer just... stopped. No warning. No log. Just stopped.

That's why best practices in infrastructure usually have a boring name like "run updates from a text console." Because they're answering a question that sounds ridiculous until you hit it: "what if the thing you're updating is actively being used by the system right now?"

The answer is: you get a silent crash that takes 14 days to debug.

Today's Lesson

You know the best practice for a reason. The reason is usually a story like this one.

Next time, I'm booting to TTY first.