Definitely good info for everyone out there... it is totally true that each application should be tested with Hyperthreading... and never really gives great benefits.
Of course I'm aware of this... and it's one of the things I like least about my mac pro. I really wish it had more real cores in it (hence why I'm excited about getting 4 more cores (2x8 vs 2x6) with these new processors). This is also one of the reasons I went with the i5 for my home iMac instead of the i7.
Even though hyperthreads don't help with FP (because of lack of duplication of floating point units) and as you say can actually be detrimental...
Another big issue with SPEC fp is memory bandwidth - even if there are idle execution units, the bottleneck is memory bandwidth to get data into and out of the CPU.
So the HT problem is not only contention for execution units (because every thread needs the same CPU resources), but also memory bandwidth (since all logical cores share the same memory channels).
BTW - Since Snow Leopard, OSX has gotten really good at managing processes in a hyperthreaded environment. It won't do anything stupid like assign two processes to the same physical core if there are other physical cores available.
Nice to hear that Apple OSX is becoming more NUMA/HT aware.
However, note that the general case is a very hard problem.
Imagine a simple case of two physical HT cores. (In the case of a single HT core, HT will always be a win.) Physical core 0 is LCPU 0 and LCPU 1, physical core 1 is LCPU 2 and LCPU 3:
- µsec 0 - 2 threads active, on LCPU 0 and LCPU 2 (scheduler has assigned active threads to idle physical cores)
- µsec 1 - 4 threads active, on LCPU 0,1,2,3 (scheduler assigns the 4 threads to the 4 logical cores)
- µsec 2 - threads on LCPU 2,3 complete, leaving two threads active on LCPU 0 & LCPU 1 (oops, the 2nd physical core is idle and two threads are sharing one physical core)
There's a real cost to moving threads between cores (register/L1/L2 caches need to flush), so you don't want to rebalance every nanosecond.
You should assume that the OS designers have done performance studies to determine an appropriate period to allow for unbalanced operation before taking the cache flush hit for moving a thread between logical CPUs. Sometimes, in fact, it can be better to suspend a thread and resume it some nanoseconds later rather than immediately move it to an idle logical CPU.
Multiprocessor scheduling is a very, very hard problem to optimize for all workloads. There are nuances to handle with single-socket multi-core systems (L1 and/or L2 caches may be private and L2/L3 caches shared - or they may be two single core CPUs in the same socket; a CPU may be complete, or it may be a logical CPU sharing some cache levels and execution units with other logical CPUs).
With multi-socket systems, the complications explode. You have all of the cache issues with single socket systems - with the complication that moving an active process from a logical CPU on one socket to a logical CPU on another socket requires flushing and invalidating all cache data on the first socket, and starting from scratch on the second socket.
And with current x64 multi-socket CPUs you have the added issue of NUMA. Since each socket has its own memory controller - a process running on a logical CPU in socket A on many systems initially will have its memory allocated on the memory connected to socket A.
If the scheduler moves the thread to socket B (or if the process has threads running on both socket A and B) threads on socket B will have slower access to memory.
Lion is even better in this regard. In our testing there is no longer any penalty for leaving hyperthreading on (only if you actually try to use them with more FP processes than you have real cores).
Lesson Is: It depends. Hyperthreads can help, but they certainly aren't as good as real cores!
Bing! You need to look at the load that's most critical for you, test it, and decide. Sometimes HT will help, sometimes HT will hurt.