Argo OS Development Sprint: Btrfs Snapshots and Binary Packages

The Weekend Sprint

November 16-17 was a pivotal weekend for Argo OS (my custom Gentoo distribution). Moved from a “barely booting” prototype to a snapshot-protected, version-controlled system with a clear path to binary distribution.

Also, I broke SDDM. Badly.

The etc-update Massacre

Saturday afternoon. Things were going well. Too well. I had a backlog of configuration file updates from a big @world rebuild — probably forty or fifty .cfg0000_ files waiting in /etc. The right move is to go through them one by one, merging the changes you want and discarding the rest. The move I actually made was:

etc-update -3

For anyone who doesn’t speak Gentoo: that flag means “automatically merge all updates, overwriting your existing configs.” Every. Single. One. No diff. No confirmation. Just scorched earth.

I watched the terminal scroll. /etc/pam.d/system-login. /etc/pam.d/sddm. /etc/conf.d/display-manager. /etc/X11/xorg.conf.d/. My hand-tuned Nvidia config. My PAM stack. My display manager settings. Gone. Replaced with upstream defaults that had no idea about my hardware.

The screen flickered. Then it went black. Then it flickered again. SDDM was trying to start, crashing, respawning, crashing, respawning. No TTY. No cursor. Just an epileptic strobe of Nvidia trying to initialize with settings that no longer existed.

The only way out: Alt+SysRq+R-E-I-S-U-B. The Magic SysRq key sequence. Reclaim keyboard, terminate everything, kill everything harder, sync disks, unmount, reboot. The nuclear option. The fact that I know this sequence by heart says something about how many times I’ve been here.

The PAM Stack Autopsy

Booted from a USB stick. Mounted the root partition. Started reading through /etc/pam.d/ to see what got blown away.

The smoking gun was /etc/pam.d/sddm. The upstream default was missing pam_loginuid.so entirely from the session block. Without that module, SDDM couldn’t properly track login sessions. It would try to authenticate, technically succeed, then fail to set up a session and crash. Over and over.

But here’s the thing about PAM debugging — it’s never just one file. PAM configs include each other. sddm includes system-login. system-login includes system-auth. Change one and you cascade through the whole stack.

My first attempt at fixing it was wrong. I added pam_loginuid.so back to the sddm config but missed that system-login had also been reset. So now SDDM could authenticate, but elogind wasn’t getting session notifications. Plasma would launch, render one frame, and then freeze because it couldn’t talk to the session manager.

Second attempt. Rebuilt all three files. Still broken. This time the Nvidia driver wasn’t loading because my custom xorg.conf.d snippet that forced nvidia_drm.modeset=1 was gone. X was trying to use nouveau, which I’d blacklisted in the kernel, so X had no GPU driver at all.

Third attempt. Fixed the Xorg config, rebuilt PAM, and then realized elogind wasn’t even in the default runlevel anymore. The etc-update had reset /etc/conf.d/display-manager which cascaded into the init scripts.

rc-update add elogind boot
rc-update add display-manager default

Fourth attempt. Everything came up. SDDM loaded. Plasma launched. I stared at my desktop for a solid ten seconds waiting for it to crash again. It didn’t.

Total debugging time: about three hours. Total files that etc-update -3 destroyed that mattered: at least seven. Files it updated that I actually wanted: maybe two.

Never use etc-update -3. Go through them manually. Yes, it takes longer. No, you won’t lose three hours rebuilding your display stack from a USB stick.

But the real fix was what I did next — making sure I had a fallback TTY so I could at least log in when the GUI died:

ln -s /etc/init.d/agetty /etc/init.d/agetty.tty1
rc-update add agetty.tty1 default

Simple. Two commands. Would have saved me from booting off a USB stick entirely. Past Me is a liability.

The openSUSE-style @ Subvolumes

With the display stack rebuilt and my heart rate back to normal, I turned to what I’d actually planned to work on this weekend: Btrfs subvolume restructuring.

I love how openSUSE handles Btrfs. It uses @ for root and @home for user data. This means you can snapshot the entire operating system without touching personal files, and roll back a bad system update without losing your documents. Clean separation.

My existing Gentoo install was just sitting on a flat Btrfs filesystem. No subvolumes. Snapshotting meant snapshotting everything, and rolling back meant rolling back everything. Useless for what I wanted.

So I decided to migrate to the @ structure. Live. On the running system.

This is the kind of thing that sounds insane until you remember that Btrfs subvolumes are basically just directory pointers, not physical partitions. In theory, you can rearrange them while the system is running. In theory.

Step one. Mount the top-level Btrfs volume (subvolid=5) somewhere accessible:

mount -o subvolid=5 /dev/nvme0n1p2 /mnt/btrfs-root

Step two. Create the @ and @home subvolumes:

btrfs subvolume create /mnt/btrfs-root/@
btrfs subvolume create /mnt/btrfs-root/@home

Step three. This is where it got intense. I needed to move the entire root filesystem — /bin, /etc, /usr, /var, everything — into the @ subvolume. On a running system. The trick is using cp -ax (which stays on the same filesystem and preserves attributes) rather than mv (which would break everything mid-operation):

cp -ax /mnt/btrfs-root/bin /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/etc /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/lib /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/lib64 /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/opt /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/root /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/sbin /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/srv /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/usr /mnt/btrfs-root/@/
cp -ax /mnt/btrfs-root/var /mnt/btrfs-root/@/

Each line felt like heart surgery. If the system panicked mid-copy, I’d have a half-migrated root that couldn’t boot. My USB stick was within arm’s reach the entire time.

Step four. Move /home into @home:

cp -ax /mnt/btrfs-root/home/* /mnt/btrfs-root/@home/

Step five. Update /etc/fstab inside the new @ subvolume to mount things correctly:

# Root
UUID=<uuid>  /      btrfs  subvol=@,compress=zstd:1,noatime  0 0
# Home
UUID=<uuid>  /home  btrfs  subvol=@home,compress=zstd:1,noatime  0 0

Step six. Update GRUB. This was the scariest part because if GRUB can’t find the kernel, you just… don’t boot. Updated GRUB_CMDLINE_LINUX to include rootflags=subvol=@ and regenerated the config:

grub-mkconfig -o /boot/grub/grub.cfg

Step seven. Deep breath. Reboot.

It worked. First try. I don’t know what I did to deserve that, but I’m not questioning it.

Snapper: The Time Machine

With the @ structure in place, the whole point of this exercise came into focus: Snapper.

Snapper is openSUSE’s snapshot management tool. It can take Btrfs snapshots on a schedule, before and after package operations, and let you roll back with a single command. It’s the safety net that makes running a bleeding-edge Gentoo distribution something other than reckless.

There’s one problem. Snapper expects systemd. Argo OS runs OpenRC.

This meant the systemd timers that normally trigger hourly snapshots don’t exist. The snapper-timeline.timer and snapper-cleanup.timer units? Gone. The DBus integration for automatic pre/post snapshots during package management? Doesn’t work the same way.

So I had to wire it up manually.

First, the cron-based timeline. Created /etc/cron.hourly/snapper-timeline:

#!/bin/bash
snapper --config root create --type single --cleanup-algorithm timeline --description "hourly"

Then the cleanup job in /etc/cron.daily/snapper-cleanup:

#!/bin/bash
snapper --config root cleanup timeline
snapper --config root cleanup number

The timeline algorithm keeps a configurable number of hourly, daily, weekly, and monthly snapshots. After some experimentation, I settled on: 8 hourly, 7 daily, 4 weekly, 3 monthly. Enough history to recover from anything, not so much that it eats the disk.

For pre/post emerge snapshots, I wrote a wrapper script that gets called by Portage hooks. Before any emerge operation, it creates a “pre” snapshot. After, it creates a “post” snapshot linked to the pre. If the emerge breaks something, I can diff the two snapshots to see exactly what changed:

snapper diff 42..43

Or just roll back entirely:

snapper rollback 42
reboot

That’s it. The entire system reverts to the state before the bad package install. If I’d had this before the etc-update -3 disaster, I would have been fixed in thirty seconds instead of three hours.

The Binary Package Infrastructure

Compiling packages locally is Gentoo’s whole thing, but it’s also Gentoo’s whole problem. A full @world update on my workstation can take hours. Firefox alone is over an hour. Chromium? Don’t even ask.

The solution: a dedicated build VM on Izar-Orchestrator that compiles everything and exports binary packages. My workstation just downloads the pre-built binaries.

Set up a VM with matching USE flags, CFLAGS, and architecture settings. The important part is that the build VM’s /etc/portage/make.conf includes:

# Build and export binary packages
FEATURES="buildpkg"
PKGDIR="/var/cache/binpkgs"

Every time the build VM compiles a package, it also creates a .gpkg.tar binary package. These get synced to a shared directory on the NAS that the workstation can access.

On the workstation side:

# /etc/portage/make.conf
PORTAGE_BINHOST="file:///mnt/binpkgs"
FEATURES="getbinpkg"

Now when I run an update:

emerge --usepkg --usepkgonly @world

The workstation checks the binhost first. If a pre-compiled package exists with matching USE flags and dependencies, it downloads and installs it directly. No compilation. What used to take hours takes minutes. Sometimes seconds.

The build VM compiles overnight on a schedule. By morning, everything’s ready. My workstation stays responsive, my fans stay quiet, and I can actually use my computer while it updates.

The Result

Ended the weekend with v0.2.1-alpha. Felt real for the first time.

Stable booting with fallback TTY. Even if the GUI dies — and it will die again, this is Gentoo — I can log in and fix things without a USB stick.

Hourly Btrfs snapshots. Eight hours of rollback history, always. Plus snapshots before and after every single package operation. The etc-update -3 disaster can never happen again. Or rather, it can happen, but the fix is snapper rollback instead of three hours with a USB stick.

Binary package distribution working. Compile times on the workstation went from hours to seconds. The build VM handles the heavy lifting overnight.

Snapper rollback tested and confirmed. I intentionally broke the system three different ways and rolled back each time. Takes about forty-five seconds including reboot.

If I break it, I can travel back in time. That’s the whole point of Argo OS — a Gentoo system that’s actually safe to experiment on.