Objective of this article
Test the new features of Proxmox VE 6 and create a 3-node cluster with Ceph directly from the graphical interface
Proxmox VE version 6.0-5
Ceph version 14.2.1 Nautilus (stable)
3 A3Server each equipped with 2 SSD disks (1 with 480GB and the other with 512GB – intentionally), 1 HDD 2TB disk and 16GB of RAM.
Type Raid: ZFS Raid 0 (on HDD)
SSD disks (sda, sdb) for Ceph
We called the nodes PVE1, PVE2, PVE3
Before starting we have created a Proxmox VE cluster of 3 nodes from the graphical interface, but it is always possible to do so even by clients. In this regard, if you love the command line, and if you have not already done so, you can create the cluster following our guide.
In the following paragraphs we will show how to make a cluster from GUI, how to install the Ceph package and its first configuration.
Download the test environment
To better understand the potential of the Cluster Proxmox VE solution and the possible configurations, we have created a laboratory aimed at testing the possible configurations of Ceph.
The lab is made up of 3 Proxmox VE virtual machines already configured in clusters with Ceph.
Below, you will find the link to download the test environment.
Move to Datacenter -> Cluster and then click on the Create Cluster button
Give a name to the Cluster you are about to create, then choose the dedicated interface, and then the Create button.
In the production environment it is always necessary, but not mandatory, to separate the Cluster interface from the other interfaces (especially Ceph’s).
It is better to have an interface for the Cluster, an interface for Ceph, and one for the GUI administration, separate from those dedicated to VMs and / or containers to keep everything clean and not incur any performance problems.
At the end of the procedure, you will see a window similar to the one below. The Cluster is therefore created, all that remains is to add the other nodes.
Now click on the Join Information button and then on the Copy Information button.
Move to any other node you want to insert, and always from the same path (Datacenter -> Cluster) click on Join Cluster, and paste the content by entering any missing values.
We have also updated the packages, again from the graphical interface, and then – as always – we have installed some basic Debian packages that are useful for troubleshooting.
# apt install htop iotop
Ceph: installation and setup
Also to install Ceph we used the convenient graphical interface. Select each node of the cluster, then move to Ceph and click on the Install Ceph-nautilus button.
Click on the Start Installation button
Then type Y and enter
Although the purpose of this article is not a deepening of Ceph (for which we refer you to the official Help page that you find at the bottom of this article, to get more details on the configuration parameters), we will spend a few minutes to quickly introduce the parameters of this page (which you see below) that will appear during the installation procedure.
Public Network: it is necessary to configure a dedicated network for Ceph, the setting is mandatory. It is highly recommended to separate Ceph traffic from the rest, because it could cause problems with other latency-dependent services such as, for example, cluster communication which, if not performed, can reduce Ceph’s performance.
Cluster Network: optionally you can also separate the OSD replication, and heartbeat traffic. This will lighten the public network (Public Network) and could lead to significant performance improvements especially in large clusters.
Number of replicas: defines the frequency with which an object is replicated
Minimum replicas: defines the minimum number of replicas required for I / O, to be marked as complete.
In this lab, and for the purpose of this article, the Ceph network is not separate from the rest!
Furthermore it is mandatory to choose the first monitor node.
If all went well you should see a successful page, like the one in the figure above, where there are further instructions on how to proceed. You are now ready to start using Ceph, but you will first need to create additional Monitors, some OSDs and at least one Pool (as you read inside!).
Opening the status page you will be able to see immediately (thanks to the intuitive use of colors and icons) if everything goes well or not. In the following image the green color suggests to us at a glance the state of health, and if you look a little better in the OSDs column you will notice that there are still no disks (OSD). Let’s see together in the next step how to create an OSD from a disk.
Ceph: OSD creation
Select a cluster node, then Ceph and still OSD. Click on Create: OSD the window below will appear where you can insert all the disks you want, and set some parameters.
In this lab, we initially chose to insert only the first SSD (480GB) for each server into Ceph; this choice to simulate a situation where there is a need to increase storage space in production environments.
Below is the final result.
Ceph: consideration of the disks
Ceph works best with a uniform and distributed amount of disks per node. For example, 4 500 GB disks in each node are better than a mixed configuration with a single 1 TB disk and three 250 GB disks.
In planning the Ceph cluster, in terms of size, it is important to consider recovery times (especially with small clusters). To optimize these times, Proxmox recommends using SSD instead of HDD in small configurations,
In general, as you know, SSDs provide more IOPs than classic spinning disks, but given the higher cost than HDDs, it might be interesting to separate class-based pools (or disk types).
Short note for those who love the command line, a quick and quick way to visually verify the concept of class, is to give the command #ceph osd tree. You will have an output, similar to the one shown in the following image, which shows the essential information on the OSD including the CLASS column, which identifies the disks (ours have the ssd value).
There is a possible configuration, supported by Proxmox VE, to speed up the OSD in a “mixed” HDD + SSD environment: use a faster disk as journal or DB / Write-Ahead-Log (WAL) device. These parameters are visible in the previous image in Ceph: OSD creation.
Always keep in mind that, if you use a faster disk for more OSDs, you will need to balance a correct balance between the OSD disk and WAL / DB (or journal), otherwise the fast disk risks becoming the bottleneck for all connected OSDs.
It is also necessary to balance the number of OSDs and their individual capacity. Increased capacity increases storage density, but also means that a single OSD error forces Ceph to retrieve more data at the same time.
Ceph: advantages in using with Proxmox VE
Ceph is a distributed object store and a file system designed to provide excellent performance, reliability and scalability.
Also defined as RADOS Block Devices (RBD) implements a functional block-level archive; using it with Proxmox VE you get the following advantages:
- Easy configuration and management with CLI and GUI support
- Thin provisioning
- Resizable volumes
- Distributed and redundant (striped across multiple OSDs)
- Support for snapshots
- Self healing – in the event of problems with automatic procedures, they try to solve the problem.
- No single point of failure
- Scalable to exabyte level
- Configuration of multiple Pools with different redundancy and performance characteristics
- The data is replicated, making it fault tolerant
- Works with inexpensive hardware
- No hardware RAID controller is required
- Open source
With recent technological developments, the new hardware (on average) has powerful CPUs and a fair amount of RAM, so it is possible to run Ceph services directly on Proxmox VE nodes. It is possible to perform archiving and VM services on the same node. This type of configuration is suitable for small and medium-sized clusters and is the subject of this lab and article.
Ceph: simulation increase of storage space
During the lab, as previously mentioned, we have deliberately inserted only one SSD per node, to be assigned as Ceph’s OSD, to verify what happens if we need to scale.
We then inserted 3 other SSDs, one for each node of the cluster, with different capacity (even if slightly – 512GB) compared to those already inserted and then one by one we made them become OSD.
In this image you can see the summary screen of the disks inside the node (in our case it is the same for each node).
The creation of the new OSD to be assigned to the existing Pool (but you can also create a new Pool, as mentioned above, based on the class of the disk), is always the same. In our case, we left all the default parameters.
Every time we add a disk, in the status we see the operations that Ceph does to be able to use the disk in the pool (or in the pools).
When the procedure ends, and if it ends correctly, you will see something similar to the one shown in the figure (with the number of OSD “In” increased):
We had no evidence of downs or malfunctions of the VMs or of the containers during the expansion of the pool or during the creation of the new OSD.
In the first part of the lab, in which we included only 1 SSD per node as Ceph’s OSD, we created a Debian VM running a 10-hour stress test, and 1 Ubuntu container.
This cluster with only 16GB of RAM and small SSD disks, we have not thought of it for production environments, but we feel obliged to inform you that despite the sizing, it has always behaved well.
After the stress test that saturated the memory, the nodes did not see each other graphically (see image below), but the cluster and the VMs / Containers continued to work perfectly.
The problem was due to a malfunction of the Corosync service which did not impact the guests. To solve, we started the service of on all three nodes:
#systemctl restart corosync.service
Simulation of fault disk
We disconnected the SSD disk on the PVE2 (where there was the only VM) to simulate a disk fault. The VM starts and works anyway (see image below – VM100) thanks to the Ceph features (see advantages above).
On the test pool (that of Ceph), the space has increased from 486GB to 650GB (trivially and without going into too much detail, removing 1 disk the information that Ceph must replicate has decreased and this means more space on each disk).
In the Ceph status pages, below in the box of Performace -> Usage the value is instead decreased from 1,31TB to 894GB (this is the sum of the sizes of the disks that we have made OSD to then create the pool).
By quickly configuring the HA settings, we simulated the down of a node. The container, subject of this test, has been moved to the node with priority configured, and then restored on the starting node once the down node is back online.
In the following images, we briefly show how we configured the Group. For those of you who do not know how to configure the HA, refer to our guide.
A small sore point, unfortunately you can’t make a storage migration, or simply Move Volume, of a container and / or VM turned on. We wanted to move the test container from the local-zfs storage to Ceph, here is the result:
Ceph seems a stable and very intuitive product, if you study the values during the configuration (see link below the official documentation).
It offers several advantages that we have already listed above, however it is necessary to consider the space that is “lost”, and to accurately design storage (even externally) perhaps using a mixture of HDD and SSD to be used with the “caching” mechanisms described in previously.
Official Proxmox VE documentation.