Proxmox VE Advanced: Cluster, HA, Ceph & API

3 min read
Proxmox VE Advanced: Cluster, HA, Ceph & API

When your environment grows from one machine to multiple machines, operations shift from "as long as it runs" to "availability, scalability, and automation." This article takes you from soloing a dungeon to raiding as a group.

Building a Cluster

Use case: for centralized management, cross-node scheduling, and migration — start with a cluster.

# Create a cluster on the first node
pvecm create my-cluster
 
# Join other nodes (replace with your master IP)
pvecm add <cluster-master-ip>
 
# Check status
pvecm status

Once the cluster is set up, avoid renaming node hostnames; make sure DNS, time sync, and network are all working before joining — otherwise you'll get strange issues.

HA (High Availability) Configuration

Use case: if a node goes down, VMs can automatically start on another node without service interruption.

Prerequisites:

  • An existing cluster
  • VM disks on shared storage
  • Proper fencing mechanism to avoid split-brain (two nodes both thinking they're the leader)
# Put VM 100 under HA management
ha-manager add vm:100 --maxrestart 3 --maxrelocate 5

When HA is configured correctly, users are almost unaware of single-node failures — like a cat always having a backup sleeping spot.

Ceph Hyper-Converged Storage

Use case: multiple nodes need shared storage with high availability — Ceph can handle both.

Recommended prerequisites:

  • At least 3 nodes (Ceph doesn't like single points)
  • Multiple dedicated disks per node
  • A dedicated storage network is even better
# Initialize Ceph
pveceph init --network 10.0.0.0/24
 
# Create an OSD for each data disk
pveceph osd create /dev/sdX
 
# Create a Pool and attach it to PVE
pveceph pool create vm-data --add_storages

Ceph works best with raw disks directly — don't layer hardware RAID on top. Doing so weakens both the data protection and observability that Ceph provides.

Storage Replication

Use case: cross-node redundancy for fast failover.

pvesr create-local-job 100 local-zfs remote-zfs --schedule "0 2 * * *"

API and Automation Integration

Use case: operate via scripts, CI/CD, or integrate with other platforms. Use API tokens rather than using the root password everywhere.

# List VMs using an API token
curl -k \
  -H "Authorization: PVEAPIToken=user@realm!tokenid=secret" \
  "https://<PVE-IP>:8006/api2/json/nodes/<node>/qemu"

Once automation is set up, repetitive work goes to scripts — you handle the coffee and the cat.

Next Steps

With advanced capabilities mastered, use a set of operations guidelines to reduce risk: 👉 Best Practices