Proxmox VE Advanced: Cluster, HA, Ceph & API

3 min read

When your environment grows from one machine to multiple machines, operations shift from "as long as it runs" to "availability, scalability, and automation." This article takes you from soloing a dungeon to raiding as a group.

Building a Cluster

Use case: for centralized management, cross-node scheduling, and migration — start with a cluster.

# Create a cluster on the first node
pvecm create my-cluster
 
# Join other nodes (replace with your master IP)
pvecm add <cluster-master-ip>
 
# Check status
pvecm status

Once the cluster is set up, avoid renaming node hostnames; make sure DNS, time sync, and network are all working before joining — otherwise you'll get strange issues.

HA (High Availability) Configuration

Use case: if a node goes down, VMs can automatically start on another node without service interruption.

Prerequisites:

  • An existing cluster
  • VM disks on shared storage
  • Proper fencing mechanism to avoid split-brain (two nodes both thinking they're the leader)
# Put VM 100 under HA management
ha-manager add vm:100 --maxrestart 3 --maxrelocate 5

When HA is configured correctly, users are almost unaware of single-node failures — like a cat always having a backup sleeping spot.

Ceph Hyper-Converged Storage

Use case: multiple nodes need shared storage with high availability — Ceph can handle both.

Recommended prerequisites:

  • At least 3 nodes (Ceph doesn't like single points)
  • Multiple dedicated disks per node
  • A dedicated storage network is even better
# Initialize Ceph
pveceph init --network 10.0.0.0/24
 
# Create an OSD for each data disk
pveceph osd create /dev/sdX
 
# Create a Pool and attach it to PVE
pveceph pool create vm-data --add_storages

Ceph works best with raw disks directly — don't layer hardware RAID on top. Doing so weakens both the data protection and observability that Ceph provides.

Storage Replication

Use case: cross-node redundancy for fast failover.

pvesr create-local-job 100 local-zfs remote-zfs --schedule "0 2 * * *"

API and Automation Integration

Use case: operate via scripts, CI/CD, or integrate with other platforms. Use API tokens rather than using the root password everywhere.

# List VMs using an API token
curl -k \
  -H "Authorization: PVEAPIToken=user@realm!tokenid=secret" \
  "https://<PVE-IP>:8006/api2/json/nodes/<node>/qemu"

Once automation is set up, repetitive work goes to scripts — you handle the coffee and the cat.

Next Steps

With advanced capabilities mastered, use a set of operations guidelines to reduce risk: 👉 Best Practices