I woke up to a cluster that had effectively turned itself into a read-only museum. My VMs were running, but I couldn’t start a new one, I couldn’t migrate a workload, and the Proxmox GUI was throwing “Cluster not ready - no quorum” errors across the board. I had a two-node setup, one node had rebooted for a kernel update, and the remaining node decided that since it didn’t have a majority, it no longer had the right to make decisions.

If you’re building a Proxmox cluster, quorum is the one concept that will either be completely invisible or the primary reason your entire infrastructure freezes. Most people treat it as a checkbox during the cluster creation wizard, but in a home lab, the math of quorum often clashes with the reality of how many physical servers you can actually fit in your rack.

What I tried first

My initial instinct was that “Cluster” simply meant “nodes that can talk to each other.” I assumed that as long as one node was alive, the cluster was alive. I set up two beefy nodes, linked them together, and felt confident.

Then I hit the “split-brain” wall. In a two-node cluster, the quorum requirement is (n/2) + 1. For two nodes, that means you need two votes to have a majority. If one node goes down, the remaining node has one vote. One is not greater than one. The remaining node loses quorum and enters a protective state. It stops allowing configuration changes to prevent a scenario where both nodes think they are the master and start writing conflicting data to shared storage, which is a great way to corrupt your VM disks.

I tried to “fix” this by manually forcing quorum on the surviving node using pvecm expected 1. It worked for a few minutes, but it’s a manual band-aid. Every time a node rebooted or a network cable acted up, I was back in the CLI fighting with the cluster manager. I realized I was fighting the fundamental design of Corosync, and the only way out was to change the voting math.

I also tried ignoring it and just relying on the “High Availability” (HA) settings in the GUI. I thought that by marking a VM as “HA,” Proxmox would just move it to the other node if one failed. I was wrong. HA depends on quorum. If the cluster loses quorum, the HA manager stops functioning because it can’t be sure if the other node is actually dead or if the network is just partitioned. Without a majority vote, the HA manager refuses to fence the other node or start the VM elsewhere, because doing so without quorum is exactly how you end up with two instances of the same VM writing to the same disk.

The actual solution

You have three real options depending on your hardware budget and your tolerance for manual intervention. I’m basing these configurations on PVE 8.x, where the Corosync implementation is stable but still strictly adheres to these voting rules.

Option 1: The Three-Node Standard

The cleanest way to solve quorum is to just add a third node. With three nodes, quorum is two votes. If one node dies, two remain. You still have a majority, and HA actually works as intended. This is the “gold standard” for a reason. You don’t have to mess with external voters or manual overrides.

Option 2: The QDevice (The “Cheap” Vote)

If you can’t justify a third full-sized server, you use a Quorum Device (QDevice). A QDevice is a lightweight external voter. It doesn’t run VMs; it just tells the cluster “Yes, I see Node A.” You can run this on a Raspberry Pi, a tiny VM on a separate host, or even a cheap VPS.

To set up a QDevice on a separate Debian/Ubuntu machine:

# On the QDevice server (the voter)
apt update && apt install corosync-qnetd

# On all Proxmox nodes
apt update && apt install corosync-qdevice

Once the software is installed, you initialize the device from one of the Proxmox nodes. Note that you need to have SSH keys exchanged between the PVE nodes and the QDevice server for this to work seamlessly.

# Run this on one PVE node
pvecm qdevice setup <IP-OF-QDEVICE-SERVER>

This adds a third vote to the cluster without requiring a third Proxmox node. Now, if one PVE node fails, the other PVE node and the QDevice provide the two votes needed to maintain quorum.

Option 3: Monitoring and API Integration

If you’re running a larger setup, you shouldn’t be checking quorum by clicking through the GUI. I integrated pve_exporter with Prometheus to get alerts the second a node loses its vote.

Since I’m using token-based authentication to avoid the security risks of root passwords in plain text (see my post on Proxmox API Tokens), the setup looks like this.

First, create a restricted user for the exporter:

# Create user with PVEAuditor role
pveum user add prometheus@pve --realm local --password sEcr3T! --groups PVEAuditors

# Create API token for prometheus@pve
pveum token add prometheus@pve prometheus --privsep 0

Then, configure the pve_exporter YAML:

api:
  token_name: prometheus
  token_value: prometheus@pve!prometheus

And the Prometheus scrape config to target the nodes:

- job_name: 'proxmox'
  metrics_path: /pve
  scrape_interval: 30s
  params:
    cluster: ['1']
    node: ['1']
  relabel_configs:
    - source_labels: [__address__]
      regex: '^(10\.0\.0\.\d+)$'
      target_label: __param_target
      replacement: $1
  static_configs:
    - targets: ['10.0.0.x:9221']

Troubleshooting Quorum Failures

When quorum fails, the symptoms are immediate. You’ll see the “Cluster not ready” banner in the GUI, and if you try to edit a VM configuration, you’ll get a “Permission denied” or “Read-only file system” error. This is because /etc/pve is a fuse-based filesystem (pmxcfs) that only allows writes when the node has quorum.

To diagnose the current state, use the pvecm tool. This is the only way to see what the cluster actually thinks is happening.

# Check quorum status
pvecm status

The output will look something like this:

Cluster name: my-cluster
Stack: corosync
Operation status: OK
Quorum: No
Nodes:
  Node A (self)  : Online
  Node B        : Offline
Quorum device: No

If Quorum: No is displayed, your node is isolated. If you are in a situation where you know the other node is dead and you just need to get your services back online without adding a QDevice immediately, you can force the expected vote count.

# Force the cluster to accept 1 vote as a majority
pvecm expected 1

After running this, pvecm status should show Quorum: Yes. Be careful: if the other node suddenly comes back online while you’ve forced quorum, you are in a high-risk split-brain scenario. If you have shared storage, this is where corruption happens.

Why it works

Proxmox uses Corosync for cluster membership and quorum. Corosync is designed for absolute consistency over availability (the “C” in the CAP theorem). It assumes that if you can’t reach a majority of your peers, you are the one who is isolated, not them.

In a two-node cluster, there is no way to distinguish between “Node B is dead” and “The network cable between Node A and Node B is unplugged.” If Node A decided to stay “active” while Node B also stayed “active,” and both tried to modify the same shared storage (like a Ceph pool or an NFS share), you’d end up with a corrupted filesystem.

By adding a third vote (either a node or a QDevice), you break the tie. The node that can still talk to the QDevice knows it is part of the majority. The node that is isolated knows it’s alone and gracefully steps back. The QDevice doesn’t need to be a powerful machine because it doesn’t participate in the actual data movement or VM execution; it’s just a witness. It’s a simple heartbeat mechanism that provides a tie-breaker.

Lessons learned

The biggest lesson here is that High Availability (HA) is a lie if you don’t have a proper quorum strategy. I spent a week thinking I had “HA” because I had two nodes and shared storage. In reality, I had a system that was fragile to a single point of failure.

I also learned that network stability is the silent killer of quorum. I had a period where a faulty Cat6 cable caused intermittent packet loss. The cluster didn’t fully fail, but it would randomly lose quorum for 5-10 seconds, causing my API-driven automation to fail with 500 errors. If you’re running a production-grade homelab, don’t skimp on the networking. If you can, use a dedicated physical NIC for Corosync traffic to isolate it from VM traffic.

Finally, don’t trust the GUI for cluster health. The GUI is just a wrapper around the API. When the cluster is in a state of flux, the GUI can be misleading or completely unresponsive. Get comfortable with pvecm status and journalctl -u corosync.

If you’re building out more complex infrastructure, like Kubernetes on top of these nodes, remember that K8s has its own quorum logic via etcd. If your Proxmox nodes are unstable, your K8s control plane will follow suit. For those looking to scale these kinds of environments professionally, I provide infrastructure consulting to help avoid these exact architectural traps.

The takeaway is simple: two nodes is not a cluster; it’s a pair. A cluster starts at three votes. Whether those votes come from full servers or a tiny QDevice on a Raspberry Pi is up to your budget, but the math is non-negotiable.