I was checking VMWare Fusion support documentation for the security corrections and VMWare published a table of the performance hit for ESXi on the support article:
VMware Performance Impact Statement for ‘L1 Terminal Fault - VMM’ (L1TF - VMM) mitigations: CVE-2018-3646 (55767)
"this scheduler provides the Hyper-Threading-aware mitigation by scheduling on only one Hyper-Thread of a Hyper-Thread-enabled core."
It's a 22 to 32% performance hit to mitigate side channel attacks:
View attachment 837504
That is very interesting, but it's still quite context specific. The link mentions somewhat of a high level strategy taken here in scheduling resources from the view of the hypervisor. 32% was the loss on a host with an OLTP database. It's likely that the use of hyperthreading is used to hide latency incurred by fetches from disk or main memory, since each core can run up to 2 threads without the need to explicitly swap them.
Mitigation of the Concurrent-context attack vector requires enabling the ESXi Side-Channel-Aware Scheduler, also referred to as the ESXi SCA Scheduler. Currently, this scheduler provides the Hyper-Threading-aware mitigation by scheduling on only one Hyper-Thread of a Hyper-Thread-enabled core.
I'm actually curious what Mixed Workload, Java, and VDI entail. The latter 2 are still server side workloads.
They also mention the following.
While each application has its own characteristics, we saw a similar pattern across different types of applications. The more usable host CPU capacity available prior to mitigation, the less the performance impact was after mitigation.
They also mention that they 32% number comes from a model that is running at 90% of cpu capacity. This is quite high for something that isn't just running long arithmetic calculations. Setups that run largely compute bound arithmetic heavy tasks may not benefit much from hyperthreading to the point where they may have disabled it anyway. You can see export specifications for a number of cpus
here. If we go by their numbers and consider a strongly compute bound problem like gemm and intel's stated numbers for
load throughput even on skylake and later, you can in fact saturate or mostly saturate peak bandwidth without scheduling as 1 physical core as 2 logical ones.
Given about a 4 cycle hit time in L1 and a claimed load bandwidth of 4 x 256 bit loads per cycle, you might end up with a little spare L1 bandwidth. I suspect though that this makes it easier to mostly saturate the cpu on classically memory bound problems, an example being elementwise multiplication.
Inner loop here should take about 12 cycles (edit: on skylake), judging by intel's numbers, if things hit in L1. I haven't timed it. Using their peak bandwidth numbers on loads, we can't fully hide load latency here and may have a couple delay cycles at the beginning. This assumes they don't overlap with consecutive loop iterations, since the stores encounter the same thing and on aligned memory, we don't have conflicts or minor latency penalties due to store forwarding.
This is still a historically memory bound problem compared to something like matrix multiplication, but I don't see a lot of benefit to running 2 threads per core here when compared to the test environment used by VMWare. They're quite different, but large sections of this kind of code are still common enough to use it as a minimal example.