Infrastructure/VM cluster: Difference between revisions

From Open Food Facts wiki
(More details :))
Line 33: Line 33:
Proxmox allows full virtualization (VM, using QEMU) and containerization (CT using LXC).
Proxmox allows full virtualization (VM, using QEMU) and containerization (CT using LXC).


We use LXC based containers (CT) to run our "Virtual Machines" zs it has a much lower overhead compared to real VM using QEMU.
We use LXC based containers (CT) to run our "Virtual Machines" as it has a much lower overhead compared to real VM using QEMU (required to run non Linux OS based VM).
These containers can contain themselves containers like docker if needed.
These containers can contain themselves containers like docker if needed.


Line 61: Line 61:


Promox '''High-Availablity''' (HA) feature automatically migrates containers when the node hosting them is down. This is detected by the other nodes, forming a "quorum" (a majority of nodes considering a node is down in the cluster). In that case, the container is started with the last replicated storage. This means that the replication frequency of these HA nodes should be high (a few minutes) as changes since the last replication we be lost in the migration.
Promox '''High-Availablity''' (HA) feature automatically migrates containers when the node hosting them is down. This is detected by the other nodes, forming a "quorum" (a majority of nodes considering a node is down in the cluster). In that case, the container is started with the last replicated storage. This means that the replication frequency of these HA nodes should be high (a few minutes) as changes since the last replication we be lost in the migration.


== Usage guidelines (to be completed) ==
== Usage guidelines (to be completed) ==

Revision as of 12:07, 16 January 2021

Cluster setup

Open Food Facts uses a Proxmox based cluster to host different virtual machines (VM) on OVH provided servers.

The cluster is made of 4 physical machines ("nodes" or "hosts" in Proxmox jargon):

  • ovh1 and ovh2 are computation oriented nodes: 24 cores, 256 GB RAM, 1TB nvme SSD
  • ovh3 and ovh4 are storage oriented nodes: 32GB RAM, 6x12 TB HDD + 512GB NVMe cache

ovh1 and ovh3 are in Roubaix datacenter, ovh2 and ovh4 in Strasbourg.

At initial setup (january 2021), v6.3 of proxmox has been installed (based on Debian 10 "buster").

Proxmox GUI is available on any of the cluster nodes on port 8006.

Cluster networking

At the networking level, a vRack links the cluster nodes with a 3Gbps private network used to access data on storage servers and replicate data between nodes. MTU is set to 9600 in the private network to take advantage of the high bandwidth.

Cluster storage

All storage is managed using ZFS which provides:

  • volume management (like lvm)
  • redundancy (like mdadm)
  • encryption (like luks)
  • compression
  • snapshots
  • quota

Snapshots allow efficient synchronization between remote storage, and is used extensively by Proxmox to replicate data across the nodes. Snapshots simplify backups and allow rollbacks.

Virtualization / Containers

Proxmox allows full virtualization (VM, using QEMU) and containerization (CT using LXC).

We use LXC based containers (CT) to run our "Virtual Machines" as it has a much lower overhead compared to real VM using QEMU (required to run non Linux OS based VM). These containers can contain themselves containers like docker if needed.

All resources are shared and dynamically allocated, thus can be reallocated at any time without reboot.

Containers are numbered (CTID) from 100 and increasing.

Network allocation

To keep things simple, CT internal IP address are allocated using the rule CTID > 10.1.0.CTID, for example 10.1.0.100

Container have access to other containers thru the 10.1.0.x IP addresses.

Storage / replication / backups / retention

Container storage is managed using ZFS subvolumes. They are dynamically allocated as quota at ZFS level, not like partitions or disk images and do not need resize. Promox GUI allows quota increase, reducing them can be done directly with ZFS CLI.

Container replication is done by Proxmox to copy a container storage on one or more other nodes. This replication creates a temporary snapshots, send the difference from the previous snapshot without scanning the filesystem itself like rsync. A typical replication takes seconds or a few minutes, not hours.

Container backups are done on snapshots which are tarred and compressed using zstd (default), providing high compression ratio with low CPU use). They are stored on a ZFS subvol named "backups", shared between nodes by NFS.

Promox allows backup retention management. It is possible to define how many daily, weekly, monthly backup to keep. Email is sent in case of backup troubles.

Migration / High-Availability

Proxmox handles container migration between nodes in the cluster. It is stopped, its storage in replicated, then it is restarted on the new node. A previously replicated container can migrate in a very short time from one node to another.

Promox High-Availablity (HA) feature automatically migrates containers when the node hosting them is down. This is detected by the other nodes, forming a "quorum" (a majority of nodes considering a node is down in the cluster). In that case, the container is started with the last replicated storage. This means that the replication frequency of these HA nodes should be high (a few minutes) as changes since the last replication we be lost in the migration.

Usage guidelines (to be completed)

Here is a few guides to follow for all new virtual machines:

  1. MUST: no direct root access on the nodes, even with SSH key.
  2. MUST: sudoers (root access using sudo) limited to SSH key based authentication
  3. SHOULD: use SSH keys published on Github: giving access to a server is then simple and secure:
    curl https://github.com/CharlesNepote.keys | tee -a ~/.ssh/authorized_keys
  4. SHOULD: take care of production resources: use "nice" / "ionice" for scripts manually launched. Stéphane's tip: just use
    nice ./mycommand whatever arguments
    (nice default to lower the priority). CPU and I/O priorities can be set if needed at the virtualization level.