Sleeping at Night: Automated KVM Backups with Bash

Sleep is Important

I have 10 VMs running on the Capella-Outpost workstation. These aren’t toys. Home Assistant controls the lights and thermostat. The build swarm orchestrator coordinates package compilation across the 66-core cluster --- if it goes down, no binary packages get built for any of my Gentoo machines. There’s a DNS resolver, a testing sandbox for Argo OS changes, and a handful of service containers that I’d rather not rebuild from scratch.

If the NVMe drive died today, all of that would be gone. Not “inconvenient” gone. “Spend a full weekend recreating everything from memory and scattered notes” gone.

I know this because it almost happened. Twice.

The first close call was a SMART warning on the boot NVMe. I noticed it in passing while checking something else entirely --- the drive had reallocated 14 sectors in a week. I ordered a replacement that night and spent the next two days doing manual dd copies of everything. No automation. No script. Just me, a terminal, and the quiet dread of watching a progress bar at 2 AM.

The second time was worse. A bad kernel update (my fault, testing a custom config) caused a filesystem panic on the VM storage partition. It recovered after an fsck, but one of the qcow2 images came back corrupted. The Home Assistant VM. The one with two years of automation history, device configs, and integrations. I rebuilt it in about six hours, mostly from documentation I’d written, but six hours I’ll never get back. That was the Saturday I decided to write this script.

The Strategy

Backing up a running KVM (libvirt) VM involves two things:

  1. The Definition: The XML configuration (CPU, RAM, network map, device passthrough, boot order --- everything).
  2. The Disk: The .qcow2 image.

Crucially, you cannot just copy the disk while the VM is writing to it. You get corruption. The filesystem inside the VM is in an inconsistent state, caches are dirty, and your “backup” is a snapshot of chaos.

So we have two choices:

  1. Snapshot Mode: Use virsh blockcommit or external snapshots. Efficient, no downtime, but complex and fragile. I’ve seen blockcommit leave orphaned overlay files that eat disk space silently.
  2. The Sledgehammer: Shut down, copy, start up. Simple. Disruptive. Honest.

Since these are homelab VMs and 3 AM is not peak usage, I chose the Sledgehammer. My daughter isn’t adjusting the thermostat at 3 AM. Probably.

The Script

#!/bin/bash
# /usr/local/bin/vm-backup.sh

BACKUP_ROOT="/backups/vms"
DATE=$(date +%Y%m%d)
TARGET_DIR="$BACKUP_ROOT/$DATE"
LOG="/var/log/vm-backup.log"
FAILED=0

mkdir -p "$TARGET_DIR"

echo "=== Backup started: $(date) ===" >> "$LOG"

# Get list of running VMs
VMS=$(virsh list --name)

for VM in $VMS; do
    echo "Processing $VM..." | tee -a "$LOG"

    # 1. Dump XML Config
    virsh dumpxml "$VM" > "$TARGET_DIR/$VM.xml"

    # 2. Get Disk Path
    DISK_PATH=$(virsh domblklist "$VM" --details | grep file | awk '{print $4}')
    DISK_NAME=$(basename "$DISK_PATH")

    # 3. Shutdown
    echo "Stopping $VM..." | tee -a "$LOG"
    virsh shutdown "$VM"

    # Wait for shutdown (timeout 60s)
    TIMEOUT=0
    while virsh list --name | grep -q "^$VM$"; do
        sleep 5
        let TIMEOUT=TIMEOUT+5
        if [ $TIMEOUT -ge 60 ]; then
            echo "Timeout waiting for shutdown. Forcing..." | tee -a "$LOG"
            virsh destroy "$VM"
            break
        fi
    done

    # 4. Copy Disk
    echo "Backing up $DISK_NAME..." | tee -a "$LOG"
    # Use sparse copy to save space!
    if cp --sparse=always "$DISK_PATH" "$TARGET_DIR/$DISK_NAME"; then
        echo "OK: $VM backed up" >> "$LOG"
    else
        echo "FAIL: $VM copy failed!" >> "$LOG"
        FAILED=1
    fi

    # 5. Start
    echo "Starting $VM..." | tee -a "$LOG"
    virsh start "$VM"
done

# Cleanup old backups (Keep 7 days)
find "$BACKUP_ROOT" -maxdepth 1 -type d -mtime +7 -exec rm -rf {} \;

echo "=== Backup finished: $(date) ===" >> "$LOG"

if [ $FAILED -ne 0 ]; then
    echo "WARNING: One or more backups failed. Check $LOG" >> "$LOG"
fi

The “Sparse” Trick

The flag cp --sparse=always is the single most important detail in this script.

My Windows VM has a 100GB allocated disk. But Windows itself only uses about 22GB of that. A normal cp reads every byte, including all the zeroed-out empty space, and writes a 100GB file. A sparse copy detects those zero-filled regions and doesn’t write them to disk. The result is a 100GB logical file that only occupies 22GB of physical storage.

Across 10 VMs, this matters. A lot. My VMs have a combined allocated size of around 480GB. Actual used space is closer to 140GB. Without sparse copies, I’d need half a terabyte per daily backup. With sparse copies, the daily set fits in about 140GB. Over a week of retention, that’s the difference between needing 3.3TB and needing under 1TB.

The tradeoff: sparse copies are slightly slower because cp has to inspect each block to decide whether to write it. On NVMe-to-NVMe copies, the overhead is negligible. On spinning rust, you’d notice. But at 3 AM, nobody’s timing it.

The Restore Stories

I’ve restored from these backups twice. Both times, it worked perfectly.

The first restore was self-inflicted. I was testing a network configuration change inside the DNS resolver VM and managed to brick its networking so thoroughly that even the virsh console couldn’t reach it. Rather than spend an hour debugging iptables rules inside a VM, I just nuked it and restored yesterday’s backup:

virsh destroy dns-resolver
cp --sparse=always /backups/vms/20251019/dns-resolver.qcow2 /var/lib/libvirt/images/
virsh define /backups/vms/20251019/dns-resolver.xml
virsh start dns-resolver

Four commands. Two minutes. Back in business.

The second restore was the build swarm orchestrator VM after a failed package update left it in a boot loop. I could have fixed it --- probably --- but I had three machines waiting for binary packages and no patience. Restored from the previous night’s backup, re-ran the update properly, and moved on. Total downtime: about 5 minutes.

Both times, the XML definition restore was just as important as the disk. The orchestrator VM has specific CPU pinning, memory hugepages, and a bridged network config that took me a while to tune. Rebuilding that from memory would have been its own adventure.

Monitoring Backup Success

A backup script that fails silently is worse than no backup script. You think you’re protected, and then the day you need it, you find three weeks of empty directories.

Every morning, I eyeball the log:

tail -20 /var/log/vm-backup.log

I’m looking for two things: the “Backup finished” line with a recent timestamp, and the absence of any “FAIL” lines. If I see something wrong, I fix it before it becomes a pattern.

I also do a quick size check every few days:

du -sh /backups/vms/$(date +%Y%m%d)

If today’s backup is dramatically smaller than yesterday’s, something didn’t copy. The numbers should be roughly consistent --- mine hover around 135-145GB depending on what the VMs have been doing.

Is this monitoring approach sophisticated? No. But it’s the kind of thing I’ll actually do, which makes it better than the elaborate notification pipeline I’d build once and never maintain.

Automation

Cron. Simple, reliable, works everywhere. On Argo OS (Gentoo with OpenRC), make sure cronie is installed and running:

rc-update add cronie default
rc-service cronie start

Then the cron entry:

# /etc/cron.d/vm-backup
0 3 * * * root /usr/local/bin/vm-backup.sh >> /var/log/vm-backup.log 2>&1

3 AM, every night. The VMs go down, get copied, come back up. Total window is usually about 12-15 minutes for all 10 VMs. Nobody notices.

Result

Every morning, I wake up to a folder full of .qcow2 images and XML definitions. Seven days of history. Sparse copies keeping the storage footprint reasonable. A log file I can check in 10 seconds.

I’ve restored from them twice. It works.

Cost: $0. Time to write: one Saturday. Peace of mind: the kind where you actually stop thinking about drive failures before falling asleep.