Linux virtualization with libvirt: NUMA management

Turns out that with any more than lightly loaded modern KVM + libvirt host system that houses more than a single CPU, NUMA tuning is essential.

If you see symptoms like high system CPU time peaks on the host, and steal CPU time on the virtual instances (or ‘missing’ CPU time), chances are the system is spending too much time transferring data between NUMA nodes. Another tell tale sign is guests / VMs that get shot down by the host because of an OOM (Out of Memory) condition, while there seems to be a healthy margin of free/buffered/cached memory.

The problem here is that host system memory is distributed across available NUMA nodes. Certain cores are associated to certain nodes. With current x86_64 machines usually one physical processor package, typically containing multiple CPU cores, also holds one NUMA node. It helps performance a lot if a single virtual machine (or any old process) is contained within a single node. The below picture, output of hwloc-ls (apt-get install hwloc) gives an idea of how this is layed out.

hwloc-ls-output

For that reason you might want to pin guests to a defined subset of logical cores, that are all in the same node. The memory assigned to guests within a single NUMA node should also together, fit within the memory the node has available. Having a non NUMA aware guest, even if there is only one, with more memory assigned to it than one NUMA node has available, does not make a lot of sense. There is a big performance penalty because part of the assigned RAM when addressed, always needs to get transferred between nodes. If you need to do this anyway, have a look at e.g. http://docs.openstack.org/developer/nova/testing/libvirt-numa.html – you can make the guest itself NUMA aware so it can efficiently use memory from multiple NUMA cells.

Helpful tools here are numastat -m (apt-get install numactl), the hwloc suite (apt-get install numactl hwloc, see picture, and you will need X forwarding (ssh -X) if you want the graphics from a remote system with hwloc-ls), lscpu.

Peripherals such as disks and NICs are also NUMA node bound, which can be important for DMA performance.

Example numastat output:

root@host-system:~# numastat -m

Per-node system memory usage (in MBs):
                          Node 0          Node 1           Total
                 --------------- --------------- ---------------
MemTotal                64328.18        64510.16       128838.34
MemFree                 10501.73         2769.72        13271.46
MemUsed                 53826.45        61740.43       115566.88
Active                  48528.07        55331.48       103859.55
Inactive                 3082.79         4013.75         7096.54
Active(anon)            45496.47        51762.40        97258.87
Inactive(anon)              0.12            0.18            0.30
Active(file)             3031.60         3569.08         6600.68
Inactive(file)           3082.67         4013.56         7096.23
Unevictable                 0.00            0.00            0.00
Mlocked                     0.00            0.00            0.00
Dirty                       0.50            0.30            0.80
Writeback                   0.00            0.00            0.00
FilePages                6115.00         7584.15        13699.15
Mapped                     13.54            6.84           20.38
AnonPages               45495.84        51761.65        97257.49
Shmem                       0.73            0.94            1.67
KernelStack                 6.70            4.98           11.69
PageTables                 94.40          124.43          218.82
NFS_Unstable                0.00            0.00            0.00
Bounce                      0.00            0.00            0.00
WritebackTmp                0.00            0.00            0.00
Slab                     1462.05         1622.14         3084.19
SReclaimable              701.69         1071.01         1772.70
SUnreclaim                760.36          551.12         1311.49
AnonHugePages            1396.00          438.00         1834.00
HugePages_Total             0.00            0.00            0.00
HugePages_Free              0.00            0.00            0.00
HugePages_Surp              0.00            0.00            0.00

From virsh capabilities, example NUMA layout:

    <topology>
      <cells num='2'>
        <cell id='0'>
          <memory unit='KiB'>65872056</memory>
          <cpus num='12'>
            <cpu id='0' socket_id='0' core_id='0' siblings='0,12'/>
            <cpu id='1' socket_id='0' core_id='1' siblings='1,13'/>
            <cpu id='2' socket_id='0' core_id='2' siblings='2,14'/>
            <cpu id='3' socket_id='0' core_id='3' siblings='3,15'/>
            <cpu id='4' socket_id='0' core_id='4' siblings='4,16'/>
            <cpu id='5' socket_id='0' core_id='5' siblings='5,17'/>
            <cpu id='12' socket_id='0' core_id='0' siblings='0,12'/>
            <cpu id='13' socket_id='0' core_id='1' siblings='1,13'/>
            <cpu id='14' socket_id='0' core_id='2' siblings='2,14'/>
            <cpu id='15' socket_id='0' core_id='3' siblings='3,15'/>
            <cpu id='16' socket_id='0' core_id='4' siblings='4,16'/>
            <cpu id='17' socket_id='0' core_id='5' siblings='5,17'/>
          </cpus>
        </cell>
        <cell id='1'>
          <memory unit='KiB'>66058400</memory>
          <cpus num='12'>
            <cpu id='6' socket_id='1' core_id='0' siblings='6,18'/>
            <cpu id='7' socket_id='1' core_id='1' siblings='7,19'/>
            <cpu id='8' socket_id='1' core_id='2' siblings='8,20'/>
            <cpu id='9' socket_id='1' core_id='3' siblings='9,21'/>
            <cpu id='10' socket_id='1' core_id='4' siblings='10,22'/>
            <cpu id='11' socket_id='1' core_id='5' siblings='11,23'/>
            <cpu id='18' socket_id='1' core_id='0' siblings='6,18'/>
            <cpu id='19' socket_id='1' core_id='1' siblings='7,19'/>
            <cpu id='20' socket_id='1' core_id='2' siblings='8,20'/>
            <cpu id='21' socket_id='1' core_id='3' siblings='9,21'/>
            <cpu id='22' socket_id='1' core_id='4' siblings='10,22'/>
            <cpu id='23' socket_id='1' core_id='5' siblings='11,23'/>
          </cpus>
        </cell>
      </cells>
    </topology>

Example (virsh edit instance) instance CPU/memory definition for the above layout:

  <memory unit='KiB'>16777216</memory>
  <currentMemory unit='KiB'>16777216</currentMemory>
  <vcpu placement='static' cpuset='8-9,20-21'>4</vcpu>

The logical CPUs in the cpuset are all within one NUMA node (cell with id 1) and the memory fits within the 64GB available to that node.

Leave a Reply