Infrastructure as Code for the Homelab
It’s a Tuesday night. Something is wrong with Promtail on the Andromeda site — logs stopped flowing to Loki about six hours ago and I didn’t notice until I went looking for a specific error message that should have been there. So I SSH into Tarn-Host. Check the Promtail config. It’s fine. Restart the service. Still broken. Check the systemd journal. The Loki endpoint URL is wrong — I updated the Loki server address last week on the local network and forgot to update the remote site.
So I fix it. Update the config on Tarn-Host. Then I remember: drone-Meridian also runs Promtail. SSH into that. Same fix. Then Meridian-Mako-Silo runs a Promtail container. Different config format. Fix that too. Then I think — did I update the Milky Way nodes? SSH into Izar-Orchestrator. Check. It’s fine there, I already fixed it. But drone-Tau-Beta has the old config. Fix. Drone-Izar? Also old. Fix. Altair-Link? Old. Fix.
Eight servers. One config change. Two hours of SSHing around, checking each one, fixing each one, hoping I didn’t miss any. I missed one. Found it the next day.
That was the last time I did it manually.
Ansible for Configuration
One server acts as the Ansible control node — Capella-Outpost, my desktop workstation. From here, I can target every machine in both galaxies.
The inventory splits across two sites. Milky Way is the local network at 10.42.0.x. Andromeda is Dad’s house, 40 miles away, at 192.168.20.x, connected through Tailscale. Different subnets, different hardware, different levels of “someone might have changed something without telling me.”
The Full Inventory
# inventory/hosts.ini
# ═══ MILKY WAY (Local - 10.42.0.x) ═══
[milkyway_hypervisors]
izar-orchestrator ansible_host=10.42.0.201 ansible_user=commander
altair-link ansible_host=10.42.0.199 ansible_user=commander
[milkyway_drones]
drone-izar ansible_host=10.42.0.203 ansible_user=commander cores=16
drone-tau-beta ansible_host=10.42.0.194 ansible_user=commander cores=8
[milkyway_storage]
rigel-silo ansible_host=10.42.0.10 ansible_user=commander
[milkyway_workstations]
canopus-outpost ansible_host=10.42.0.100 ansible_user=commander
# ═══ ANDROMEDA (Remote - 192.168.20.x) ═══
[andromeda_hypervisors]
tarn-host ansible_host=192.168.20.100 ansible_user=commander
[andromeda_drones]
drone-tarn ansible_host=192.168.20.196 ansible_user=commander cores=14
drone-meridian ansible_host=192.168.20.50 ansible_user=commander cores=24
[andromeda_storage]
meridian-mako-silo ansible_host=192.168.20.10 ansible_user=commander
cassiel-silo ansible_host=192.168.20.20 ansible_user=commander
# ═══ GROUP AGGREGATION ═══
[hypervisors:children]
milkyway_hypervisors
andromeda_hypervisors
[drones:children]
milkyway_drones
andromeda_drones
[storage:children]
milkyway_storage
andromeda_storage
[gentoo:children]
hypervisors
drones
milkyway_workstations
[all_local:children]
milkyway_hypervisors
milkyway_drones
milkyway_storage
milkyway_workstations
[all_remote:children]
andromeda_hypervisors
andromeda_drones
andromeda_storage
Group by location, by OS, by role. A single playbook can target gentoo (every Gentoo machine), drones (just the build swarm workers), andromeda_storage (just the remote NAS boxes), or any combination.
System Updates
The playbook that replaced my two-hour SSH marathon:
# playbooks/update-gentoo.yml
---
- hosts: gentoo
become: yes
serial: 3 # Don't update everything at once
tasks:
- name: Sync portage tree
command: emerge --sync
changed_when: true
- name: Create pre-update snapshot
command: snapper --config root create --type pre --description "ansible-update"
register: pre_snapshot
when: "'snapper' in ansible_facts.packages"
- name: Update world set
command: emerge -uDN @world --usepkg --binpkg-respect-use=y
register: emerge_result
changed_when: "'Total' in emerge_result.stdout"
- name: Create post-update snapshot
command: >
snapper --config root create --type post
--pre-number {{ pre_snapshot.stdout | trim }}
--description "ansible-update"
when: pre_snapshot is not skipped
- name: Rebuild kernel modules if needed
command: emerge @module-rebuild
when: "'sys-kernel' in emerge_result.stdout"
- name: Reboot if kernel was updated
reboot:
msg: "Kernel update - rebooting"
reboot_timeout: 300
when: "'sys-kernel' in emerge_result.stdout"
serial: 3 is important. Without it, Ansible would try to update every machine simultaneously. That means every drone goes offline at once, plus the orchestrators, plus the gateway. The build swarm dies, any running builds are lost, and I get to explain to nobody why the entire infrastructure is down. Three at a time. The swarm keeps running on the remaining nodes.
The Snapper integration is my favorite part. Every Ansible-driven update creates a pre/post snapshot pair. If an update breaks something, I can see exactly what changed and roll back that specific machine. This is the safety net that makes automated updates across eight machines feel sane instead of terrifying.
Monitoring Deployment
The Promtail incident — the one that started all of this — is now a one-liner:
# playbooks/deploy-monitoring.yml
---
- hosts: gentoo
become: yes
roles:
- role: node_exporter
vars:
node_exporter_version: "1.7.0"
node_exporter_listen: "0.0.0.0:9100"
- role: promtail
vars:
loki_url: "http://10.42.0.100:3100/loki/api/v1/push"
promtail_scrape_configs:
- job_name: syslog
syslog:
listen_address: "0.0.0.0:1514"
- job_name: journal
journal:
path: /var/log/journal
handlers:
- name: restart promtail
service:
name: promtail
state: restarted
Change the loki_url once in the playbook. Run it. Every machine gets the correct config. The Promtail incident can’t happen again because there’s one source of truth for the Loki endpoint, not eight independently-maintained config files that I have to remember to update.
Build Swarm Drone Setup
Bringing a new drone online used to be a multi-hour process. Install Gentoo, configure portage, set up the build user, install dependencies, configure SSH keys, register with the orchestrator. Now:
# playbooks/setup-drone.yml
---
- hosts: new_drone
become: yes
vars:
orchestrator_host: "10.42.0.201"
binhost_url: "http://10.42.0.194:8080/binpkgs"
tasks:
- name: Configure binary package host
lineinfile:
path: /etc/portage/make.conf
regexp: '^PORTAGE_BINHOST='
line: 'PORTAGE_BINHOST="{{ binhost_url }}"'
- name: Set FEATURES for drone operation
lineinfile:
path: /etc/portage/make.conf
regexp: '^FEATURES='
line: 'FEATURES="buildpkg getbinpkg parallel-fetch candy"'
- name: Create build user
user:
name: drone-build
shell: /bin/bash
groups: portage
create_home: yes
- name: Deploy SSH authorized keys
authorized_key:
user: drone-build
key: "{{ lookup('file', 'files/drone-build.pub') }}"
- name: Install drone dependencies
portage:
package:
- dev-python/paramiko
- app-misc/screen
- sys-process/htop
usepkg: yes
- name: Deploy drone agent config
template:
src: templates/drone-agent.conf.j2
dest: /etc/drone-agent.conf
owner: drone-build
mode: '0640'
- name: Register with orchestrator
uri:
url: "http://{{ orchestrator_host }}:8080/api/drones/register"
method: POST
body_format: json
body:
hostname: "{{ inventory_hostname }}"
cores: "{{ ansible_processor_vcpus }}"
address: "{{ ansible_default_ipv4.address }}"
register: registration
Spin up a VM, add it to the inventory, run the playbook. Twenty minutes later it’s accepting build jobs.
Terraform for External Resources
Ansible handles what’s inside the machines. Terraform handles what’s outside — the stuff that exists in someone else’s cloud and would be a pain to recreate from memory.
Cloudflare DNS
Every public-facing service routes through Cloudflare Tunnels. The DNS records for these are managed in Terraform, not clicked through a web UI:
# terraform/cloudflare/dns.tf
resource "cloudflare_record" "tunnel_cname" {
zone_id = var.cloudflare_zone_id
name = "home"
value = var.tunnel_cname
type = "CNAME"
proxied = true
}
resource "cloudflare_record" "grafana" {
zone_id = var.cloudflare_zone_id
name = "grafana"
value = var.tunnel_cname
type = "CNAME"
proxied = true
}
resource "cloudflare_record" "gitea" {
zone_id = var.cloudflare_zone_id
name = "git"
value = var.tunnel_cname
type = "CNAME"
proxied = true
}
I add a service, I add a CNAME, I run terraform apply. If I need to see what DNS records exist, it’s in the code. Not buried in a web dashboard somewhere.
Remote State
Terraform state is stored in S3, encrypted. Not on my local disk. If my house burns down, the infrastructure configuration survives.
terraform {
backend "s3" {
bucket = "homelab-terraform-state"
key = "infrastructure/terraform.tfstate"
region = "us-east-1"
encrypt = true
}
}
The Monorepo
Everything lives in one Git repository:
infrastructure/
├── ansible/
│ ├── inventory/
│ │ ├── hosts.ini
│ │ ├── group_vars/
│ │ │ ├── gentoo.yml # Portage settings, USE flags
│ │ │ ├── drones.yml # Build swarm config
│ │ │ ├── milkyway.yml # Local network vars
│ │ │ └── andromeda.yml # Remote network vars
│ │ └── host_vars/
│ │ ├── izar-orchestrator.yml
│ │ └── tarn-host.yml
│ ├── playbooks/
│ │ ├── update-gentoo.yml
│ │ ├── deploy-monitoring.yml
│ │ ├── setup-drone.yml
│ │ ├── backup-configs.yml
│ │ └── security-hardening.yml
│ └── roles/
│ ├── node_exporter/
│ ├── promtail/
│ ├── base_gentoo/
│ └── drone_agent/
├── terraform/
│ ├── cloudflare/
│ ├── backups/
│ └── modules/
└── kubernetes/
└── flux-manifests/
This is the source of truth. If it’s not in Git, it doesn’t exist. Every change is a commit. Every commit has a diff. Six months from now, when something breaks and I can’t remember what I changed, git log --oneline ansible/playbooks/ tells me exactly what happened and when.
What I’d Change
I’ve been running this setup since August. Some of it is great. Some of it has rough edges.
The inventory file is getting unwieldy. A flat INI file was fine for five machines. With fifteen-plus hosts, group_vars, host_vars, and two sites, it’s starting to creak. I should probably switch to a dynamic inventory or at least break the INI into per-site files. Haven’t done it yet because the current setup works and I’m afraid of breaking the playbook targeting.
Ansible is slow over Tailscale. Running playbooks against the Andromeda site adds noticeable latency. Each task opens a new SSH connection (or reuses one, depending on pipelining config), and every round-trip through Tailscale adds ~38ms. A playbook with 20 tasks takes almost a full second just in connection overhead. Not terrible, but annoying when you’re watching it run.
I don’t test playbooks before running them on production. I know. I know. There’s no staging environment. Tau-Beta exists as a test VM, but I rarely bother to run playbooks against it first. Someday a bad playbook is going to break all eight Gentoo machines simultaneously and I’ll deserve it.
Terraform drift is real. I’ve manually changed Cloudflare settings through the web UI “just this once” at least four times. Every time, the next terraform plan shows a diff and I have to reconcile it. The discipline of never touching the web UI is harder than it sounds, especially at 2 AM when something is broken and you just want to flip a toggle.
The monorepo is the right call. No regrets there. One repo, one source of truth, one git log. I tried separate repos for Ansible and Terraform once and spent more time managing the repos than managing the infrastructure.
The point of all this isn’t perfection. It’s that when something breaks at midnight — and it will — I can fix it once and know it’s fixed everywhere. That Promtail incident? Eight servers, two hours, missed one. Now it’s one playbook, three minutes, misses nothing.