Building a Production Homelab: Multi-Node Proxmox Cluster with Kubernetes

Most homelab content stops at “I installed Proxmox on a NUC.” This post goes further — how I built a multi-node cluster running a full Kubernetes platform with GPU inference, GitOps, automated backups, and dozens of services. No cloud. No monthly bill. Real production patterns on consumer hardware.

This isn’t a step-by-step tutorial. It’s a field report — what worked, what broke, and what I’d do differently if I started over.

Why Build This

I wanted to understand infrastructure at a level that cloud platforms deliberately abstract away. Running your own cluster forces you to deal with networking, storage, scheduling, and failure modes firsthand. There’s no “open a support ticket” when your node freezes at 3 AM.

Beyond the learning, the homelab actually runs things I use daily — AI agent development, media management, home automation, workflow orchestration. Every service runs on hardware I own, on a network I control.

The Hardware

All AMD Ryzen, mostly mini PCs. The whole cluster fits on a shelf and draws less power than a single rack server. Mix of Ryzen 5, 7, and 9 chips across the nodes, with RAM ranging from 32 GB to 128 GB per node depending on workload.

One node is a custom ITX build with a Tesla P40 (24 GB VRAM) for LLM inference. Everything else is off-the-shelf mini PCs in the $200-400 range.

Hardware Lessons

Buy the same hardware when possible. I didn’t, and I regret it. When you have a bunch of different hardware platforms, you get a bunch of different BIOS menus, thermal profiles, and failure modes. Two of my nodes are identical — and those two are by far the easiest to manage.

Check your RAM speed. One of my nodes ran DDR4 at 2133 MT/s for months because DOCP (AMD’s XMP equivalent) was disabled in BIOS. Enabling it brought it up to 3200 MT/s — basically free performance sitting on the table. Always verify with dmidecode -t memory after building a new node.

iGPU steals RAM on AMD APUs. Newer AMD chips default to reserving 6-8 GB for the integrated GPU. On headless servers, that’s pure waste. Setting UMA Frame Buffer to minimum in BIOS reclaimed over 15 GB across my cluster. And no, blacklisting amdgpu at the kernel level doesn’t help — the reservation happens at the BIOS level before the OS even boots.

Proxmox Cluster

All nodes run Proxmox VE 8.x in a corosync cluster. The quorum math means I can lose multiple nodes simultaneously without losing cluster coordination, which has come in handy more than once.

The AMD C-State Freeze Bug

This was the single most frustrating issue in the entire build. A couple of my Ryzen-based nodes would randomly hard-freeze — no SSH, no console, no kernel panic. Just dead. The only clue was NVMe SMART data showing the “unsafe shutdowns” counter climbing.

Root cause: buggy C2/C3 power states on certain AMD Zen 2/3 mobile CPUs. The fix is limiting the kernel to C1:

# In your GRUB config
GRUB_CMDLINE_LINUX_DEFAULT="quiet processor.max_cstate=1"

I also disabled deeper C-states in BIOS on every node and deployed a hardware watchdog on the most crash-prone node. The watchdog auto-reboots if the kernel stops responding — not elegant, but it means I don’t wake up to a dead node anymore.

Takeaway: Consumer hardware in always-on environments will surface firmware bugs that desktop users never see. Keep an eye on your NVMe SMART data — unsafe shutdowns is your canary for silent crashes.

Kubernetes on Proxmox

The K8s cluster runs on dedicated VMs — a set of control plane nodes (tainted NoSchedule) and several workers spread across the physical hosts.

Why VMs, Not Bare Metal

Every K8s node is a Proxmox VM. This gives me:

Live migration — move a worker to another host during maintenance without draining pods
Snapshots — snapshot a node before a risky kernel upgrade
Resource isolation — K8s doesn’t own the entire machine; LXCs for DNS, backups, and other services run alongside
Templating — new workers are cloned from a template with kubeadm and containerd pre-installed

The overhead is minimal compared to the operational flexibility.

Storage: Longhorn

Longhorn provides distributed block storage across the cluster. I run a few storage classes for different access patterns:

Replicated RWO — the default, data survives a node failure
Single-replica RWO — for large volumes where re-downloading is cheaper than replicating (LLM models, for instance)
ReadWriteMany — NFS-backed, for shared data like media libraries

Longhorn handles automated snapshots and backups, which has saved me more than once during storage failures.

Gotcha: snapshot chain limits. Longhorn’s engine has a hardcoded chain limit of around 20 snapshots. If your recurring snapshot jobs retain more than that, replica rebuilds stall silently. I learned this the hard way when a node failure triggered a rebuild that just… hung. Keep your retention count well under that limit, and set up cleanup jobs.

Ingress and TLS

Traefik handles ingress with MetalLB providing the LoadBalancer IP. cert-manager automates TLS via Cloudflare DNS-01 challenges — all certificates auto-renew, no manual intervention.

A wildcard DNS record means any new service just needs an IngressRoute and it’s immediately accessible with valid TLS. Spinning up a new service and having it reachable with HTTPS in under a minute never gets old.

Gotcha: wildcard DNS and Kubernetes DNS resolution don’t play nice. Kubernetes pods default to ndots:5 in their resolv.conf, which means short hostnames get the cluster search domains appended before trying the bare name. If you have a wildcard DNS entry for your internal domain, pods trying to reach external hosts can accidentally resolve something.external.com.your-internal-domain.com — which matches your wildcard and returns your Traefik IP. The TLS handshake then fails because the cert doesn’t match.

Fix: override ndots in the pod’s dnsConfig for any workload that needs to reach external registries or APIs.

dnsConfig:
  options:
    - name: ndots
      value: "2"

GPU Passthrough

One node has an NVIDIA Tesla P40 passed through to a K8s worker VM for LLM inference. Getting this working required:

IOMMU enabled in both BIOS and kernel cmdline
q35 machine type on the VM (the default i440fx doesn’t support PCI passthrough)
NVIDIA drivers installed inside the VM
NVIDIA Container Toolkit with the nvidia runtime set as the default containerd runtime — not just available, but default. The NVIDIA device plugin daemonset doesn’t specify a runtimeClass, so it uses whatever the default is. If that’s not nvidia, it can’t find the GPU libraries.
Node taint to prevent non-GPU workloads from scheduling on the expensive node

This runs Ollama serving quantized models for AI agent inference across the homelab.

GPU Passthrough Gotchas

These are the kind of things no tutorial mentions.

PCI address instability. If your GPU sits behind a PCIe switch (common on consumer boards), the bus address can shift on every reboot. One day it’s at one address, next reboot it’s somewhere else. You’ll need to check and potentially update the VM’s passthrough config after any host reboot. Annoying, but manageable.

NIC name instability. Same root cause — if a network card is on the same PCIe switch, it gets the same bus renumbering problem. The fix is a systemd .link file that pins the interface name by MAC address instead of relying on PCI bus topology.

D3cold bricking. If a GPU enters the D3cold power state and you try to PCI-remove it, it won’t come back on rescan. Full host reboot required. Disable D3cold before doing any passthrough operations.

Deployment strategy. With only one GPU, your LLM deployment must use strategy: Recreate, not RollingUpdate. Rolling updates create a surge pod that requests GPU resources — but there’s no second GPU to schedule it on. Deadlock. Your old pod gets killed, the new pod can’t start, and you’re down until you intervene.

GitOps with ArgoCD

ArgoCD manages all workloads using an app-of-apps pattern. Manifest changes flow through Git — push a change, ArgoCD detects drift, syncs automatically. I haven’t kubectl apply’d anything by hand in months and it’s honestly hard to imagine going back.

CI/CD Pipeline

GitHub Actions workflows run on a self-hosted runner inside the cluster:

IaC pipeline — OpenTofu plan on PR, apply on merge (manages Proxmox resources)
Container builds — detects changed Dockerfiles, builds images, pushes to a registry
Manifest validation — kubeconform checks K8s manifests on every PR

One fun lesson: the runner uses Docker inside an unprivileged LXC. Newer versions of runc (1.3.x+) break in that environment due to a sysctl permission issue. I had to pin to an older Docker version. This is the kind of thing that costs you a day if you let packages auto-update without testing.

IaC Lesson: Guard Your Stateful Resources

My IaC pipeline auto-applies on merge to main. That’s fine for Kubernetes manifests — they’re declarative and non-destructive. It’s not fine for infrastructure resources where “update” sometimes means “destroy and recreate.”

I learned this when a simple IP address change in a container definition triggered a destroy-and-recreate cycle. The container was destroyed successfully, but recreation failed due to a permission issue. Result: container gone, data gone, no way to get it back.

The fix is straightforward:

lifecycle {
  prevent_destroy = true
  ignore_changes  = [initialization, network_interface]
}

Treat stateful resources as pets, not cattle — even in your IaC. prevent_destroy is cheap insurance.

What’s Running

The cluster hosts a lot of services across a bunch of namespaces. The broad categories:

AI/LLM — inference server, vector database, RAG interfaces, AI workflow platforms
Agents — multi-agent system for SRE, development, and personal automation
Monitoring — metrics collection, LLM observability, dashboards
Media — the usual *arr stack, media server, document management
Infrastructure — ingress, cert management, distributed storage, GitOps, policy engine, backups, secrets management
Other — workflow automation, home automation, design tools, time-series databases

Everything has TLS, persistent storage, and automated backups. Secrets are encrypted in Git via SealedSecrets and only decrypted inside the cluster.

Cost

The whole thing runs on roughly $30/month in electricity. No cloud bills, no per-seat licensing, no egress charges. Hardware was acquired over 18 months, mostly mini PCs in the $200-400 range. The GPU node was the most expensive piece.

For what I’d pay for a few months of equivalent cloud resources, I own the hardware outright and have no recurring costs beyond power and internet.

What I’d Do Differently

Start with identical nodes. Hardware diversity is operational complexity in disguise. Every BIOS update, every firmware quirk, every thermal issue is unique to that specific platform. Three identical nodes beats seven different ones.

Budget for faster networking from the start. 1 Gbps works fine until you have distributed storage doing replica rebuilds across the same link as your workloads and cluster heartbeats. Network bandwidth is the bottleneck I hit most often. 2.5 GbE or 10 GbE would’ve saved me headaches.

Pin every version. Container runtimes, GPU drivers, system packages — if you let any of these auto-update, you will eventually wake up to a broken cluster. Pin versions, test upgrades on a single node, then roll out.

Keep IaC and stateful resources on separate tracks. Auto-apply is great for declarative K8s manifests. It’s dangerous for infrastructure where “update” can mean “destroy.” Separate the blast radius.

Wrapping Up

A homelab at this scale isn’t a weekend project — it’s an ongoing thing that teaches you something new every week. The AMD C-state bug taught me about hardware watchdogs. The wildcard DNS issue taught me about Kubernetes DNS resolution mechanics. The GPU passthrough taught me more about PCIe topology than I ever wanted to know.

Every issue in this post came from running real workloads on real hardware, 24/7, for months. That’s the value proposition of a homelab: it breaks in ways that tutorials never cover, and fixing those breaks is where the actual learning happens.

If you’re thinking about building something similar — start small, document everything, and expect to be surprised. The infrastructure will teach you whether you’re ready or not.