Achieving high density deployments with NUMA

There are many factors that can affect the performance of Virtual Machines (VMs) running on host hardware. One of these is how the VM interacts with NUMA (non-uniform memory access).

This section provides an overview of NUMA and how it applies to Pexip Infinity Conferencing Nodes. It summarizes our recommendations and suggests best practices for maximizing performance.

About NUMA

NUMA is an architecture that divides the computer into a number of nodes, each containing one or more processor cores and associated memory. A core can access its local memory faster than it can access the rest of the memory on that machine. In other words, it can access memory allocated to its own NUMA node faster than it can access memory allocated to another NUMA node on the same machine.

The diagram (right) outlines the physical components of a host server and shows the relationship to each NUMA node.

Conferencing Nodes and NUMA nodes

We strongly recommend that a Pexip Infinity Conferencing Node VM is deployed on a single NUMA node to avoid the loss of performance incurred when a core accesses memory outside its own node.

In practice, with modern servers, each socket represents a NUMA node. We therefore recommend that:

  • one Pexip Infinity Conferencing Node VM is deployed per socket of the host server
  • the number of vCPUs that the Conferencing Node VM is configured to use is the same as or less than the number of physical cores available in that socket (unless you are taking advantage of hyperthreading to deploy one vCPU per logical thread — in which case see NUMA affinity and hyperthreading).

This second diagram shows how the components of a Conferencing Node virtual machine relate to the server components and NUMA nodes.

You can deploy smaller Conferencing Nodes over fewer cores/threads than are available in a single socket, but this will reduce capacity.

Deploying a Conferencing Node over more cores (or threads when pinned) than provided by a single socket will cause loss of performance, as and when remote memory is accessed. This must be taken into account when moving Conferencing Node VMs between host servers with different hardware configuration: if an existing VM is moved to a socket that contains fewer cores/threads than the VM is configured to use, the VM will end up spanning two sockets and therefore NUMA nodes, thus impacting performance.

To prevent this occurring, ensure that either:

  • you only deploy Conferencing Nodes on servers with a large number of cores per processor
  • the number of vCPUs used by each Conferencing Node is the same as (or less than) the number of cores/threads available on each NUMA node of even your smallest hosts.

NUMA affinity and hyperthreading

You can utilize the logical threads of a socket (hyperthreading) to deploy a Conferencing Node VM with two vCPUs per physical core (i.e. one per logical thread) to achieve up to 50% additional capacity.

However, if you do this you must ensure that all Conferencing Node VMs are pinned to their respective sockets within the hypervisor (also known as NUMA affinity). Otherwise, the Conferencing Node VMs will end up spanning multiple NUMA nodes, resulting in a loss of performance.

Affinity does NOT guarantee or reserve resources, it simply forces a VM to use only the socket you define, so mixing Pexip Conferencing Node VMs that are configured with NUMA affinity together with other VMs on the same server is not recommended.

NUMA affinity is not practical in all data center use cases, as it forces a given VM to run on a certain CPU socket (in this example), but is very useful for high-density Pexip deployments with dedicated capacity.

NUMA affinity for Pexip Conferencing Node VMs should only be used if the following conditions apply:

  • If you are using Hyper-V, it is part of a Windows Server Datacenter Edition (the Standard Edition does not have the appropriate configuration options).
  • The server/blade is used for Pexip Conferencing Node VMs only, and the server will have only one Pexip Conferencing Node VM per CPU socket (or two VMs per server in a dual socket CPU e.g. E5-2600 generation).
  • vMotion (VMware) or Live Migration (Hyper-V) is NOT used. (Using these may result in having two nodes both locked to a single socket, meaning both will be attempting to access the same processor, with neither using the other processor.)
  • You fully understand what you are doing, and you are happy to revert back to the standard settings, if requested by Pexip support, to investigate any potential issues that may result.

Step-by-step guides

For instructions on how to achieve NUMA pinning (also known as NUMA affinity) for your particular hypervisor, see:

Achieving ultra-high density with Sub-NUMA Clustering

In almost all circumstances, we recommend that Sub-NUMA Clustering [SNC] is turned off. Where SNC is enabled and there is a single Pexip Infinity node on that socket, the node is likely to underutilize resources and can fail.

We recommend a maximum of 48 vCPU per transcoding node, up to 56 vCPU for parts with a high base clock speed. Some 3rd- and 4th-generation Intel Xeon Scalable Series processors have well in excess of 28 physical cores, so it is not possible to utilize the whole processor with Hyper-Threading on a single transcoding node.

Optimizing for density

Our standard recommendation for high performance is the Intel Xeon Gold 6342. To gain higher density it is currently necessary to move up to the Xeon Platinum line which is significantly more expensive. In many cases it is cheaper to deploy more Xeon Gold 6342 machines to gain extra capacity.

Where rack space is at a premium or a requirement for more than 2S scalability dictates a Xeon Platinum line anyway, increasing density with SNC often represents a sensible choice.

Two transcoding nodes per socket

All 3rd- and 4th-generation Intel Xeon Scalable Series processors support SNC. For parts with 32 physical cores or more, we recommend using SNC to treat the processor as two separate NUMA nodes. Under normal operation, a 1U 2-socket server with 2x Intel Xeon Gold 6342 processors would achieve around 195HD capacity. Using a processor with a higher core count and SNC allows the same 1U 2-socket chassis to offer over 300HD of transcoding capacity over four transcoding nodes.

Deployment

As with most hypervisor features, we recommend that this is carried out by people who possess advanced skills with the relevant hypervisor.

Each socket should be split into two equally-sized sub-NUMA nodes, 0 and 1. For node 0, use the entirety of the node for the transcoding node; for node 1 reserve 2 vCPU for the hypervisor and use the rest of it for another transcoding node.

Example

An Intel Xeon Platinum 8360Y has 36 physical cores. With only a 2.4GHz base clock speed, it is not the ideal choice: a processor with a higher clock speed will give better results.

Use cores 0-17 as sub-NUMA node 0 and cores 18-35 as sub-NUMA node 1. Cores 0-17 should be used as 36 vCPU Hyper-Threaded transcoding node, and cores 18-34 should be used as a 34 vCPU transcoding node with core 35 reserved for the hypervisor.

In this case the 2-socket server produces around 280HD of capacity; a faster or larger processor could easily exceed 300HD per rack unit.

Summary of deployment recommendations

We are constantly optimizing our use of the host hardware and expect that some of this advice may change in later releases of our product. However our current recommendations are:

  • Prefer processors with a high core count.
  • Prefer a smaller number of large Conferencing Nodes rather than a larger number of smaller Conferencing Nodes.
  • Deploy one Conferencing Node per NUMA node (i.e. per socket).
  • Configure one vCPU per physical core on that NUMA node (without hyperthreading and NUMA pinning), or one vCPU per logical thread (with hyperthreading and all VMs pinned to a socket in the hypervisor).
  • Populate memory equally across all NUMA nodes on a single host server.
  • Do not over-commit resources on hardware hosts.