Turns out that with any more than lightly loaded modern KVM + libvirt host system that houses more than a single CPU, NUMA tuning is essential.
If you see symptoms like high system CPU time peaks on the host, and steal CPU time on the virtual instances (or ‘missing’ CPU time), chances are the system is spending too much time transferring data between NUMA nodes. Another tell tale sign is guests / VMs that get shot down by the host because of an OOM (Out of Memory) condition, while there seems to be a healthy margin of free/buffered/cached memory.
The problem here is that host system memory is distributed across available NUMA nodes. Certain cores are associated to certain nodes. With current x86_64 machines usually one physical processor package, typically containing multiple CPU cores, also holds one NUMA node. It helps performance a lot if a single virtual machine (or any old process) is contained within a single node. The below picture, output of hwloc-ls
(apt-get install hwloc
) gives an idea of how this is layed out.
For that reason you might want to pin guests to a defined subset of logical cores, that are all in the same node. The memory assigned to guests within a single NUMA node should also together, fit within the memory the node has available. Having a non NUMA aware guest, even if there is only one, with more memory assigned to it than one NUMA node has available, does not make a lot of sense. There is a big performance penalty because part of the assigned RAM when addressed, always needs to get transferred between nodes. If you need to do this anyway, have a look at e.g. http://docs.openstack.org/developer/nova/testing/libvirt-numa.html – you can make the guest itself NUMA aware so it can efficiently use memory from multiple NUMA cells.
Helpful tools here are numastat -m
(apt-get install numactl
), the hwloc suite (apt-get install numactl hwloc
, see picture, and you will need X forwarding (ssh -X
) if you want the graphics from a remote system with hwloc-ls
), lscpu
.
Peripherals such as disks and NICs are also NUMA node bound, which can be important for DMA performance.
Example numastat output:
root@host-system:~# numastat -m
Per-node system memory usage (in MBs):
Node 0 Node 1 Total
--------------- --------------- ---------------
MemTotal 64328.18 64510.16 128838.34
MemFree 10501.73 2769.72 13271.46
MemUsed 53826.45 61740.43 115566.88
Active 48528.07 55331.48 103859.55
Inactive 3082.79 4013.75 7096.54
Active(anon) 45496.47 51762.40 97258.87
Inactive(anon) 0.12 0.18 0.30
Active(file) 3031.60 3569.08 6600.68
Inactive(file) 3082.67 4013.56 7096.23
Unevictable 0.00 0.00 0.00
Mlocked 0.00 0.00 0.00
Dirty 0.50 0.30 0.80
Writeback 0.00 0.00 0.00
FilePages 6115.00 7584.15 13699.15
Mapped 13.54 6.84 20.38
AnonPages 45495.84 51761.65 97257.49
Shmem 0.73 0.94 1.67
KernelStack 6.70 4.98 11.69
PageTables 94.40 124.43 218.82
NFS_Unstable 0.00 0.00 0.00
Bounce 0.00 0.00 0.00
WritebackTmp 0.00 0.00 0.00
Slab 1462.05 1622.14 3084.19
SReclaimable 701.69 1071.01 1772.70
SUnreclaim 760.36 551.12 1311.49
AnonHugePages 1396.00 438.00 1834.00
HugePages_Total 0.00 0.00 0.00
HugePages_Free 0.00 0.00 0.00
HugePages_Surp 0.00 0.00 0.00
From virsh capabilities
, example NUMA layout:
<topology>
<cells num='2'>
<cell id='0'>
<memory unit='KiB'>65872056</memory>
<cpus num='12'>
<cpu id='0' socket_id='0' core_id='0' siblings='0,12'/>
<cpu id='1' socket_id='0' core_id='1' siblings='1,13'/>
<cpu id='2' socket_id='0' core_id='2' siblings='2,14'/>
<cpu id='3' socket_id='0' core_id='3' siblings='3,15'/>
<cpu id='4' socket_id='0' core_id='4' siblings='4,16'/>
<cpu id='5' socket_id='0' core_id='5' siblings='5,17'/>
<cpu id='12' socket_id='0' core_id='0' siblings='0,12'/>
<cpu id='13' socket_id='0' core_id='1' siblings='1,13'/>
<cpu id='14' socket_id='0' core_id='2' siblings='2,14'/>
<cpu id='15' socket_id='0' core_id='3' siblings='3,15'/>
<cpu id='16' socket_id='0' core_id='4' siblings='4,16'/>
<cpu id='17' socket_id='0' core_id='5' siblings='5,17'/>
</cpus>
</cell>
<cell id='1'>
<memory unit='KiB'>66058400</memory>
<cpus num='12'>
<cpu id='6' socket_id='1' core_id='0' siblings='6,18'/>
<cpu id='7' socket_id='1' core_id='1' siblings='7,19'/>
<cpu id='8' socket_id='1' core_id='2' siblings='8,20'/>
<cpu id='9' socket_id='1' core_id='3' siblings='9,21'/>
<cpu id='10' socket_id='1' core_id='4' siblings='10,22'/>
<cpu id='11' socket_id='1' core_id='5' siblings='11,23'/>
<cpu id='18' socket_id='1' core_id='0' siblings='6,18'/>
<cpu id='19' socket_id='1' core_id='1' siblings='7,19'/>
<cpu id='20' socket_id='1' core_id='2' siblings='8,20'/>
<cpu id='21' socket_id='1' core_id='3' siblings='9,21'/>
<cpu id='22' socket_id='1' core_id='4' siblings='10,22'/>
<cpu id='23' socket_id='1' core_id='5' siblings='11,23'/>
</cpus>
</cell>
</cells>
</topology>
Example (virsh edit instance
) instance CPU/memory definition for the above layout:
<memory unit='KiB'>16777216</memory>
<currentMemory unit='KiB'>16777216</currentMemory>
<vcpu placement='static' cpuset='8-9,20-21'>4</vcpu>
The logical CPUs in the cpuset are all within one NUMA node (cell with id 1) and the memory fits within the 64GB available to that node.