VMware NUMA affinity and hyperthreading

This topic explains how to experiment with VMware NUMA affinity and Hyper-Threading Technology for Pexip Infinity Conferencing Node VMs, in order to achieve up to 50% additional capacity.

If you are taking advantage of hyperthreading to deploy two vCPUs per physical core (i.e. one per logical thread), you must first enable NUMA affinity; if you don't, the Conferencing Node VM will end up spanning multiple NUMA nodes, resulting in a loss of performance.

Affinity does NOT guarantee or reserve resources, it simply forces a VM to use only the socket you define, so mixing Pexip Conferencing Node VMs that are configured with NUMA affinity together with other VMs on the same server is not recommended.

NUMA affinity is not practical in all data center use cases, as it forces a given VM to run on a certain CPU socket (in this example), but is very useful for high-density Pexip deployments with dedicated capacity.

This information is aimed at administrators with a strong understanding of VMware, who have very good control of their VM environment, and who understand the consequences of conducting these changes.

Please ensure you have read and implemented our recommendations in Achieving high density deployments with NUMA before you continue.

Prerequisites

VMware NUMA affinity for Pexip Conferencing Node VMs should only be used if the following conditions apply:

vMotion is NOT used. (Using this may result in having two nodes both locked to a single socket, meaning both will be attempting to access the same processor, with neither using the other processor.)
You fully understand what you are doing, and you are happy to revert back to the standard settings, if requested by Pexip support, to investigate any potential issues that may result.

Example server without NUMA affinity - allows for more mobility of VMs

Example server with NUMA affinity - taking advantage of hyperthreading to gain 30-50% more capacity per server

Overview of process

We will configure the two Conferencing Node VMs (in this example, an E5-2600 CPU with two sockets per server) with the following advanced VMware parameters:

Conferencing Node A locked to Socket 0

cpuid.coresPerSocket = 1
numa.vcpu.preferHT = TRUE
numa.nodeAffinity = 0

Conferencing Node B locked to Socket 1

cpuid.coresPerSocket = 1
numa.vcpu.preferHT = TRUE
numa.nodeAffinity = 1

You must also double-check the flag below to ensure it matches the number of vCPUs in the Conferencing Node:

numa.autosize.vcpu.maxPerVirtualNode

For example, it should be set to 24 if that was the number of vCPUs you assigned.

Note that if you are experiencing different sampling results from multiple nodes on the same host, you should also ensure that Numa.PreferHT = 1 is set (to ensure it operates at the ESXi/socket level). See https://kb.vmware.com/s/article/2003582 for more information.

Setting NUMA affinity

Before you start, please consult your local VMware administrator to understand whether this is appropriate in your environment.

Shut down the Conferencing Node VMs, to allow you to edit their settings.
Give the Conferencing Node VMs names that indicate that they are locked to a given socket (NUMA node). In the example below the VM names are suffixed by numa0 and numa1:
Right-click the first Conferencing Node VM in the inventory and select Edit Settings.
From the VM Options tab, expand the Advanced section and select Edit Configuration:
At the bottom of the window that appears, enter the following Names and corresponding Values for the first VM, which should be locked to the first socket (numa0):
- cpuid.coresPerSocket = 1
- numa.vcpu.preferHT = TRUE
- numa.nodeAffinity = 0
It should now look like this in the bottom of the parameters list:
Select OK and OK again.

Now our conf-node_numa0 Virtual Machine is locked to numa0 (the first socket).
Repeat the above steps for the second node, entering the following data for the second VM, which should be locked to the second socket (numa1):
- cpuid.coresPerSocket = 1
- numa.vcpu.preferHT = TRUE
- numa.nodeAffinity = 1
It should now look like this in the bottom of the parameters list:
Select OK and OK again.

Now our conf-node_numa1 Virtual Machine is locked to numa1 (the second socket).

It is very important that you actually set numa.nodeAffinity to 1 and not 0 for the second node. If both are set to 0, you will effectively only use numa node 0, and they will fight for these resources while leaving numa node 1 unused.

Increasing vCPUs

You must now increase the number of vCPUs assigned to your Conferencing Nodes, to make use of the hyperthreaded cores. (Hyperthreading must always be enabled, and is generally enabled by default.)

Count logical processors

First you must check how many logical processors each CPU has.

In the example screenshot below, the E5-2680 v3 CPU has 12 physical cores per CPU socket, and there are two CPUs on the server.

With hyperthreading, each physical core has 2 logical processors , so the CPU has 24 logical processors (giving us a total of 48 with both CPUs).

In this case 2 x 12 = 24 is the "magic number" we are looking for with our Conferencing Nodes - which is double the amount of Cores per Socket.

Assign vCPU and RAM

Next, you must edit the settings on the Virtual Machines to assign 24 vCPU and 24 GB RAM to each of the two Conferencing Nodes.

Ensure that the server actually has 24 GB of RAM connected to each CPU socket. Since all four memory channels should be populated with one RAM module each, you will normally require 4 x 8 GB per CPU socket.

Reboot

Finally, save and boot up your virtual machines. After about 5 minutes they should be booted, have performed their performance sampling, and be available for calls.

Viewing updated capacity

To view the updated capacity of the Conferencing Nodes, log in to the Pexip Management Node, select Status > Conferencing Nodes and then select one of the nodes you have just updated. The Maximum capacity - HD connections field should now show slightly less than one HD call per GHz (compared to the previous one HD call per 1.41 GHz).

In our example, 12 physical cores x 2.6 GHz = 31.2 GHz, so the Conferencing Node should show around 30 or 31 HD calls, assuming a balanced BIOS power profile. With maximum performance BIOS power profiles, the results could be up to 33-34 HD calls per Conferencing Node VM.

Our first VM:

Our second VM:

Checking for warnings

You should check for warnings by searching the administrator log (History & Logs > Administrator log) for "sampling".

A successful run of the above example should return something like:

2015-04-05T18:25:40.390+00:00 softlayer-lon02-cnf02 2015-04-05 18:25:40,389 Level="INFO" Name="administrator.system" Message="Performance sampling finished" Detail="HD=31 SD=60 Audio=240"

An unsuccessful run, where VMware has split the Conferencing Node over multiple NUMA nodes, would return the following warning in addition to the result of the performance sampling:

2015-04-06T17:42:17.084+00:00 softlayer-lon02-cnf02 2015-04-06 17:42:17,083 Level="WARNING" Name="administrator.system" Message="Multiple numa nodes detected during sampling" Detail="We strongly recommend that a Pexip Infinity Conferencing Node is deployed on a single NUMA node"

2015-04-06T17:42:17.087+00:00 softlayer-lon02-cnf02 2015-04-06 17:42:17,086 Level="INFO" Name="administrator.system" Message="Performance sampling finished" Detail="HD=21 SD=42 Audio=168"

If you have followed the steps in this guide to set NUMA affinity correctly and you are getting the warning above, this could be due to another VMware setting. From VMware, select the Conferencing Node and then select Edit settings > Options > General > Configuration parameters...). The numa.autosize.vcpu.maxPerVirtualNode option should be set to your "magic number". For example, 24 is our "magic number" - the number of logical processors, or vCPUs, assigned in our example.

If this option is set to anything lower, e.g. 8, 10 or 12, VMware will create two virtual NUMA nodes, even if locked on one socket.

BIOS settings

Ensure all BIOS settings pertaining to power saving are set to maximize performance rather than preserve energy. (Setting these to an energy-preserving or balanced mode may impact transcoding capacity, thus reducing the total number of HD calls that can be provided.) While this setting will use slightly more power, the alternative is to add another server in order to achieve the increase in capacity, and that would in total consume more power than one server running in high performance mode.

The actual settings will depend on the hardware vendor; see BIOS performance settings for some examples.

A quick way to verify that BIOS has been set appropriately is to check the hardware's Power Management settings in VMware (select the host then select Configure > Hardware > Power Management). In most cases, the ACPI C-states should not be exposed to VMware when BIOS is correctly set to maximize performance.

If the ACPI C-states are showing in VMware (as shown below), the BIOS has most likely not been set to maximize performance :

When BIOS has been correctly set to maximize performance, it should in most cases look like this:

If your server is set to maximize performance, but VMware still shows ACPI C-states, change it to balanced (or similar), and then change back to maximize performance. This issue has been observed with some Dell servers that were preconfigured with maximize performance, but the setting did not take effect initially.

VMware and NUMA

As well as the physical restrictions discussed above, the hypervisor can also impose restrictions. VMware provides virtual NUMA nodes on VMs that are configured with more than 8 CPUs. If you have fewer than 8 CPUs, you should change this default by setting numa.vcpu.min in the VM's configuration file to the number of vCPUs you wish to configure (which will be double the number of CPUs you have available).

For more information, see https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.resmgmt.doc/GUID-3E956FB5-8ACB-42C3-B068-664989C3FF44.html.