Infrastructure as Code: Ansible and Terraform for the Homelab

Infrastructure as Code for the Homelab

It’s a Tuesday night. Something is wrong with Promtail on the Andromeda site — logs stopped flowing to Loki about six hours ago and I didn’t notice until I went looking for a specific error message that should have been there. So I SSH into Tarn-Host. Check the Promtail config. It’s fine. Restart the service. Still broken. Check the systemd journal. The Loki endpoint URL is wrong — I updated the Loki server address last week on the local network and forgot to update the remote site.

So I fix it. Update the config on Tarn-Host. Then I remember: drone-Meridian also runs Promtail. SSH into that. Same fix. Then Meridian-Mako-Silo runs a Promtail container. Different config format. Fix that too. Then I think — did I update the Milky Way nodes? SSH into Izar-Orchestrator. Check. It’s fine there, I already fixed it. But drone-Tau-Beta has the old config. Fix. Drone-Izar? Also old. Fix. Altair-Link? Old. Fix.

Eight servers. One config change. Two hours of SSHing around, checking each one, fixing each one, hoping I didn’t miss any. I missed one. Found it the next day.

That was the last time I did it manually.

Ansible for Configuration

One server acts as the Ansible control node — Capella-Outpost, my desktop workstation. From here, I can target every machine in both galaxies.

The inventory splits across two sites. Milky Way is the local network at 10.42.0.x. Andromeda is Dad’s house, 40 miles away, at 192.168.20.x, connected through Tailscale. Different subnets, different hardware, different levels of “someone might have changed something without telling me.”

The Full Inventory

# inventory/hosts.ini

# ═══ MILKY WAY (Local - 10.42.0.x) ═══

[milkyway_hypervisors]
izar-orchestrator  ansible_host=10.42.0.201  ansible_user=commander
altair-link        ansible_host=10.42.0.199  ansible_user=commander

[milkyway_drones]
drone-izar         ansible_host=10.42.0.203  ansible_user=commander  cores=16
drone-tau-beta     ansible_host=10.42.0.194  ansible_user=commander  cores=8

[milkyway_storage]
rigel-silo         ansible_host=10.42.0.10   ansible_user=commander

[milkyway_workstations]
canopus-outpost    ansible_host=10.42.0.100  ansible_user=commander

# ═══ ANDROMEDA (Remote - 192.168.20.x) ═══

[andromeda_hypervisors]
tarn-host          ansible_host=192.168.20.100  ansible_user=commander

[andromeda_drones]
drone-tarn         ansible_host=192.168.20.196  ansible_user=commander  cores=14
drone-meridian     ansible_host=192.168.20.50   ansible_user=commander  cores=24

[andromeda_storage]
meridian-mako-silo ansible_host=192.168.20.10   ansible_user=commander
cassiel-silo       ansible_host=192.168.20.20   ansible_user=commander

# ═══ GROUP AGGREGATION ═══

[hypervisors:children]
milkyway_hypervisors
andromeda_hypervisors

[drones:children]
milkyway_drones
andromeda_drones

[storage:children]
milkyway_storage
andromeda_storage

[gentoo:children]
hypervisors
drones
milkyway_workstations

[all_local:children]
milkyway_hypervisors
milkyway_drones
milkyway_storage
milkyway_workstations

[all_remote:children]
andromeda_hypervisors
andromeda_drones
andromeda_storage

Group by location, by OS, by role. A single playbook can target gentoo (every Gentoo machine), drones (just the build swarm workers), andromeda_storage (just the remote NAS boxes), or any combination.

System Updates

The playbook that replaced my two-hour SSH marathon:

# playbooks/update-gentoo.yml
---
- hosts: gentoo
  become: yes
  serial: 3  # Don't update everything at once
  tasks:
    - name: Sync portage tree
      command: emerge --sync
      changed_when: true

    - name: Create pre-update snapshot
      command: snapper --config root create --type pre --description "ansible-update"
      register: pre_snapshot
      when: "'snapper' in ansible_facts.packages"

    - name: Update world set
      command: emerge -uDN @world --usepkg --binpkg-respect-use=y
      register: emerge_result
      changed_when: "'Total' in emerge_result.stdout"

    - name: Create post-update snapshot
      command: >
        snapper --config root create --type post
        --pre-number {{ pre_snapshot.stdout | trim }}
        --description "ansible-update"
      when: pre_snapshot is not skipped

    - name: Rebuild kernel modules if needed
      command: emerge @module-rebuild
      when: "'sys-kernel' in emerge_result.stdout"

    - name: Reboot if kernel was updated
      reboot:
        msg: "Kernel update - rebooting"
        reboot_timeout: 300
      when: "'sys-kernel' in emerge_result.stdout"

serial: 3 is important. Without it, Ansible would try to update every machine simultaneously. That means every drone goes offline at once, plus the orchestrators, plus the gateway. The build swarm dies, any running builds are lost, and I get to explain to nobody why the entire infrastructure is down. Three at a time. The swarm keeps running on the remaining nodes.

The Snapper integration is my favorite part. Every Ansible-driven update creates a pre/post snapshot pair. If an update breaks something, I can see exactly what changed and roll back that specific machine. This is the safety net that makes automated updates across eight machines feel sane instead of terrifying.

Monitoring Deployment

The Promtail incident — the one that started all of this — is now a one-liner:

# playbooks/deploy-monitoring.yml
---
- hosts: gentoo
  become: yes
  roles:
    - role: node_exporter
      vars:
        node_exporter_version: "1.7.0"
        node_exporter_listen: "0.0.0.0:9100"

    - role: promtail
      vars:
        loki_url: "http://10.42.0.100:3100/loki/api/v1/push"
        promtail_scrape_configs:
          - job_name: syslog
            syslog:
              listen_address: "0.0.0.0:1514"
          - job_name: journal
            journal:
              path: /var/log/journal

  handlers:
    - name: restart promtail
      service:
        name: promtail
        state: restarted

Change the loki_url once in the playbook. Run it. Every machine gets the correct config. The Promtail incident can’t happen again because there’s one source of truth for the Loki endpoint, not eight independently-maintained config files that I have to remember to update.

Build Swarm Drone Setup

Bringing a new drone online used to be a multi-hour process. Install Gentoo, configure portage, set up the build user, install dependencies, configure SSH keys, register with the orchestrator. Now:

# playbooks/setup-drone.yml
---
- hosts: new_drone
  become: yes
  vars:
    orchestrator_host: "10.42.0.201"
    binhost_url: "http://10.42.0.194:8080/binpkgs"

  tasks:
    - name: Configure binary package host
      lineinfile:
        path: /etc/portage/make.conf
        regexp: '^PORTAGE_BINHOST='
        line: 'PORTAGE_BINHOST="{{ binhost_url }}"'

    - name: Set FEATURES for drone operation
      lineinfile:
        path: /etc/portage/make.conf
        regexp: '^FEATURES='
        line: 'FEATURES="buildpkg getbinpkg parallel-fetch candy"'

    - name: Create build user
      user:
        name: drone-build
        shell: /bin/bash
        groups: portage
        create_home: yes

    - name: Deploy SSH authorized keys
      authorized_key:
        user: drone-build
        key: "{{ lookup('file', 'files/drone-build.pub') }}"

    - name: Install drone dependencies
      portage:
        package:
          - dev-python/paramiko
          - app-misc/screen
          - sys-process/htop
        usepkg: yes

    - name: Deploy drone agent config
      template:
        src: templates/drone-agent.conf.j2
        dest: /etc/drone-agent.conf
        owner: drone-build
        mode: '0640'

    - name: Register with orchestrator
      uri:
        url: "http://{{ orchestrator_host }}:8080/api/drones/register"
        method: POST
        body_format: json
        body:
          hostname: "{{ inventory_hostname }}"
          cores: "{{ ansible_processor_vcpus }}"
          address: "{{ ansible_default_ipv4.address }}"
      register: registration

Spin up a VM, add it to the inventory, run the playbook. Twenty minutes later it’s accepting build jobs.

Terraform for External Resources

Ansible handles what’s inside the machines. Terraform handles what’s outside — the stuff that exists in someone else’s cloud and would be a pain to recreate from memory.

Cloudflare DNS

Every public-facing service routes through Cloudflare Tunnels. The DNS records for these are managed in Terraform, not clicked through a web UI:

# terraform/cloudflare/dns.tf

resource "cloudflare_record" "tunnel_cname" {
  zone_id = var.cloudflare_zone_id
  name    = "home"
  value   = var.tunnel_cname
  type    = "CNAME"
  proxied = true
}

resource "cloudflare_record" "grafana" {
  zone_id = var.cloudflare_zone_id
  name    = "grafana"
  value   = var.tunnel_cname
  type    = "CNAME"
  proxied = true
}

resource "cloudflare_record" "gitea" {
  zone_id = var.cloudflare_zone_id
  name    = "git"
  value   = var.tunnel_cname
  type    = "CNAME"
  proxied = true
}

I add a service, I add a CNAME, I run terraform apply. If I need to see what DNS records exist, it’s in the code. Not buried in a web dashboard somewhere.

Remote State

Terraform state is stored in S3, encrypted. Not on my local disk. If my house burns down, the infrastructure configuration survives.

terraform {
  backend "s3" {
    bucket  = "homelab-terraform-state"
    key     = "infrastructure/terraform.tfstate"
    region  = "us-east-1"
    encrypt = true
  }
}

The Monorepo

Everything lives in one Git repository:

infrastructure/
├── ansible/
│   ├── inventory/
│   │   ├── hosts.ini
│   │   ├── group_vars/
│   │   │   ├── gentoo.yml        # Portage settings, USE flags
│   │   │   ├── drones.yml        # Build swarm config
│   │   │   ├── milkyway.yml      # Local network vars
│   │   │   └── andromeda.yml     # Remote network vars
│   │   └── host_vars/
│   │       ├── izar-orchestrator.yml
│   │       └── tarn-host.yml
│   ├── playbooks/
│   │   ├── update-gentoo.yml
│   │   ├── deploy-monitoring.yml
│   │   ├── setup-drone.yml
│   │   ├── backup-configs.yml
│   │   └── security-hardening.yml
│   └── roles/
│       ├── node_exporter/
│       ├── promtail/
│       ├── base_gentoo/
│       └── drone_agent/
├── terraform/
│   ├── cloudflare/
│   ├── backups/
│   └── modules/
└── kubernetes/
    └── flux-manifests/

This is the source of truth. If it’s not in Git, it doesn’t exist. Every change is a commit. Every commit has a diff. Six months from now, when something breaks and I can’t remember what I changed, git log --oneline ansible/playbooks/ tells me exactly what happened and when.

What I’d Change

I’ve been running this setup since August. Some of it is great. Some of it has rough edges.

The inventory file is getting unwieldy. A flat INI file was fine for five machines. With fifteen-plus hosts, group_vars, host_vars, and two sites, it’s starting to creak. I should probably switch to a dynamic inventory or at least break the INI into per-site files. Haven’t done it yet because the current setup works and I’m afraid of breaking the playbook targeting.

Ansible is slow over Tailscale. Running playbooks against the Andromeda site adds noticeable latency. Each task opens a new SSH connection (or reuses one, depending on pipelining config), and every round-trip through Tailscale adds ~38ms. A playbook with 20 tasks takes almost a full second just in connection overhead. Not terrible, but annoying when you’re watching it run.

I don’t test playbooks before running them on production. I know. I know. There’s no staging environment. Tau-Beta exists as a test VM, but I rarely bother to run playbooks against it first. Someday a bad playbook is going to break all eight Gentoo machines simultaneously and I’ll deserve it.

Terraform drift is real. I’ve manually changed Cloudflare settings through the web UI “just this once” at least four times. Every time, the next terraform plan shows a diff and I have to reconcile it. The discipline of never touching the web UI is harder than it sounds, especially at 2 AM when something is broken and you just want to flip a toggle.

The monorepo is the right call. No regrets there. One repo, one source of truth, one git log. I tried separate repos for Ansible and Terraform once and spent more time managing the repos than managing the infrastructure.

The point of all this isn’t perfection. It’s that when something breaks at midnight — and it will — I can fix it once and know it’s fixed everywhere. That Promtail incident? Eight servers, two hours, missed one. Now it’s one playbook, three minutes, misses nothing.