nMP Turbo Boost

VirtualRain · Jan 10, 2014

I spent some time today trying to see how Turbo Boost works on the nMP.

Here's what I was expecting in terms of Turbo Boost clock increases (from Marco.org)...
- E5-1650 v2 3.5GHz base; Turbo Modes: (1/1/2/2/2/4)
- This means: 3.9GHz for single core, 3.7GHz for 2-4 cores, and 3.6GHz under full load.

Tools: I downloaded the Intel Power Gadget for OS X which reports actual CPU frequency.

It's worth noting that the IA Frequency reported in the GUI of Intel Power Gadget is not necessarily correct. I'm guessing it's some kind of average over the update interval. The correct CPU speed which is always a multiple of 100MHz is reported in the log file.

I also came across theSeb's terminal command for monitoring thermal limiting. (There was none during any of these tests).

Idle: At idle (surfing the web for example) the frequency of my 6-core averages 3.5GHz. Looking at the log file, it fluctuates between 3.5 and 3.7GHz but will only see spikes higher than 3.5GHz for very short intervals (about 0.1 second). The first attachment shows the CPU frequency and processor power for a few seconds of idle.

Single Core: I used CinebenchR15 which allows you to set a custom number of threads... with a single thread, you can see the results in the first screen shot. The frequency would fluctuate erratically between 3.60 and 3.70GHz. I'm not sure why it fluctuates and I'm not sure why I'm not seeing higher turbo boost under this scenario.

Are all the other OS X threads related to the OS and background tasks killing potential Turbo modes?. See the second attachment.

2-6 Cores: When I run a lightly threaded benchmark like CinebenchR15 CPU (with 2 or 4 threads) the outcome is the same as running a fully threaded benchmark that utilizes all cores (like Handbrake). The CPU frequency is pegged at 3.60GHz the entire time (see last attachment).

Aperture: When I'm flipping through images in Aperture, the CPU fluctuates between 3.5 and 3.7GHz just like it does during a single core benchmark.

Conclusions/Questions: I can't get any sustained frequency above 3.6GHz. 3.7GHz appears very occasionally for short intervals - not sustained like it should under 2-4 core load.

I'm open to suggestions on how to better test this and perhaps trigger sustained 3.9 or 3.7GHz operation.

If anyone else knows of any tools that work under OS X to monitor turbo states, that would be great. And if anyone has Windows running on their nMP, it would be interesting to see what you can determine (as there are a lot of CPU monitoring tools available under Windows).

deconstruct60 · Jan 10, 2014

VirtualRain said:
Idle: At idle (surfing the web for example) the frequency of my 6-core averages 3.5GHz. Looking at the log file, it fluctuates between 3.5 and 3.7GHz but will only see spikes higher than 3.5GHz for very short intervals (about 0.1 second). The first attachment shows the CPU frequency and processor power for a few seconds of idle.

The base and Turbo modes are not fixed or "sticky". The CPU can dynamically adjust the clock several times in a second.

Single Core: I used CinebenchR15 which allows you to set a custom number of threads... with a single thread, you can see the results in the first screen shot.

Setting to a single thread has absolutely zero impact on OS , its threads , and at least other dozen other processes running.

The frequency would fluctuate erratically between 3.60 and 3.70GHz.

Not particularly erratic given there are multiple processes getting time slices every second. Can drive a dominate reading by Cinebench soaking up more cycles than everything else, but in no way is this configuration setting driving the all the processes onto just one core. OS X might even move the Cinebench thread around to different cores over time if it isn't tagged with some flavor of fixed processor affinity setting.

Are all the other OS X threads related to the OS and background tasks killing potential Turbo modes?.

They aren't killing it. The average CPU clock rate is higher. But if there is work for 2-3 cores the choice is either take time slices away from this app thread to let them run or spread the workload out over multiple cores. Once the cores are lit up even briefly that will have short term impact on power.

OS X's process scheduler probably doesn't take much of Turbo and the multicore interactions into account for smaller sharing workloads that might possibly be weaved into 1-2 cores.

phoenixsan · Jan 10, 2014

Let me....

congratulate you. Very good work and precisely documented. Sounds like you are an engineer.....

The Windows testing can be wortwhile. One of the features more impressive about modern Apple hardware is the ability to run another OSes out of the box. In quality hardware, not less.....

VirtualRain · Jan 10, 2014

deconstruct60 said:
OS X's process scheduler probably doesn't take much of Turbo and the multicore interactions into account for smaller sharing workloads that might possibly be weaved into 1-2 cores.

So are you saying that what I observed is expected and there's no way under OS X to get continual operation at 3.7 or 3.9GHz?

AidenShaw · Jan 10, 2014

phoenixsan said:
The Windows testing can be wortwhile. One of the features more impressive about modern Apple hardware is the ability to run another OSes out of the box. In quality hardware, not less.....

The rest of us have been running Windows and Linux and Solaris and more esoteric OSes on quality hardware for ages. We're really not impressed that a system can run more than one operating system.

Hardware-wise, the "new Mini Pro" is an "amateur hour" project compared to the capabilities of Xeon E5-xxxxx v2 systems from the enterprise players. (64GiB RAM support vs 512GiB - LOL) It's pretty, small, quiet, and extremely limited. (And once you start attaching all the external boxes needed to make it useful - the "quiet factor" is lost.)

As I mentioned in an earlier post, in the BIOS on one of my 16 physical core (dual 8 physical core) HP ProLiants I limited the system to "2 physical cores per socket, hyperthreading disabled" - turning a 16-core system into a dual socket quad.

It's latched to full turbo mode - it always shows "core speed 126%" regardless of the load. I think that the OP is showing some bug in the Apple OSX kernel. Or some configuration parameter is biased towards "save power" instead of "max performance".

deconstruct60 · Jan 10, 2014

VirtualRain said:
So are you saying that what I observed is expected and there's no way under OS X to get continual operation at 3.7 or 3.9GHz?

The gimmick of using boot firmware to "turn off" cores is typically the only way get "constant operation". Turbo was never designed for constant operation at all. What "base" and "turbo" are designed for is the dynamic range between the two points. They are boundaries on the range in which the CPU operates but shouldn't particularly be on one extreme side or the other in normal workloads.
[ limiting an application's primary workload to just one thread very much makes it a "plain jane" normal workload. That is the same instruction presentation would see in apps from 20 years ago. ]

Operating systems try to pick which process to run where very quickly so they can let apps get back to what they were doing. That can mean they don't have time to consider every single possible option. If process 222 was on CPU core 4 before there are often good reasons to send it right back there.... even if core 4 isn't doing anything else inbetween this and the last time process 222 woke up to do something briefly and go back to sleep.

Similar to reasons OS X punts on taking NUMA factors into account.

bearcatrp · Jan 10, 2014

Download Boinc (grid computing). You can specify how many threads to crunch at a time. Suggest you load up work units, then start at one work unit crunching, then two, etc.. Watch IPG and see what happens. I crunch all the time but also use it to test my new builds.

Tutor · Jan 10, 2014

VirtualRain said:
... .
I'm open to suggestions on how to ... perhaps trigger sustained 3.9 or 3.7GHz operation.
... .

Try running simultaneously (at least start them right after the other as quickly as is possible) 1) CPUTest, (2) Cinebench 11.5, (3) Cinebench 15, (4) Geekbench2, (5) Geekbench3 and (6) mCoreTest64. Turbo Boost isn't ever likely to be "sustained" for the same core(s), for as that core(s) rises in temp and power consumed to take and engage the load, carrying the load puts it in the position of having to step down and let a relatively cooler, less well power feed, unused core take it's place. Thus, though the turbo speed might appear to be a little sustained, in reality it's more like round robin whack a mole. That's what I learned in 2010-11 when I began underclocking and turbo biasing my 12-core systems to have turbo ratios of DDDDEE (13, 13, 13, 13, 14, 14) for each 5680. It was like watching, through Windows of course, popping corn on steroids.

VirtualRain · Jan 10, 2014

deconstruct60 said:
The gimmick of using boot firmware to "turn off" cores is typically the only way get "constant operation". Turbo was never designed for constant operation at all. What "base" and "turbo" are designed for is the dynamic range between the two points. They are boundaries on the range in which the CPU operates but shouldn't particularly be on one extreme side or the other in normal workloads.
[ limiting an application's primary workload to just one thread very much makes it a "plain jane" normal workload. That is the same instruction presentation would see in apps from 20 years ago. ]

Operating systems try to pick which process to run where very quickly so they can let apps get back to what they were doing. That can mean they don't have time to consider every single possible option. If process 222 was on CPU core 4 before there are often good reasons to send it right back there.... even if core 4 isn't doing anything else inbetween this and the last time process 222 woke up to do something briefly and go back to sleep.

Similar to reasons OS X punts on taking NUMA factors into account.

I get what you're saying... I mean looking at Activity Monitor at idle, I have over 700 threads active from 162 processes, but CPU utilization is less than 1%.

However, Turbo Boost is either over-rated and the marketing is misleading, or it's not working properly, or I'm unable to test it properly.

Attached is a graph of CPU frequency over the duration of a Geekbench run. It spends 24% of the time at 3.7GHz and 76% of the time at 3.6GHz.

Yet TDP is an average of only 53W (peak of 85W)... and CPU temperature is around 60-deg C... so there's plenty of thermal headroom to work with.

It would seem it could have run the whole time at 3.7GHz without any negative impact.

bearcatrp said:
Download Boinc (grid computing). You can specify how many threads to crunch at a time. Suggest you load up work units, then start at one work unit crunching, then two, etc.. Watch IPG and see what happens. I crunch all the time but also use it to test my new builds.

Had a look, that's just too much effort and to invasive.

VirtualRain · Jan 10, 2014

Tutor said:
Try running simultaneously (at least start them right after the other as quickly as is possible) 1) CPUTest, (2) Cinebench 11.5, (3) Cinebench 15, (4) Geekbench2, (5) Geekbench3 and (6) mCoreTest64. Turbo Boost isn't ever likely to be "sustained" for the same core(s), for as that core(s) rises in temp and power consumed to take and engage the load, carrying the load puts it in the position of having to step down and let a relatively cooler, less well power feed, unused core take it's place. Thus, though the turbo speed might appear to be a little sustained, in reality it's more like round robin whack a mole. That's what I learned in 2010-11 when I began underclocking and turbo biasing my 12-core systems to have turbo ratios of DDDDEE (13, 13, 13, 13, 14, 14) for each 5680. It was like watching, through Windows of course, popping corn on steroids.

LOL... thanks... I guess it is what it is then.

EDIT: I tried a bunch of benchmarks at once... of course, the CPU was pegged at 3.6GHz the whole time, which is what you'd expect for this fully loaded 6-core.

VirtualRain · Jan 10, 2014

Unless someone has some other great ideas, I guess the moral of this short story is that early news about the 8, 6 and 4 core all sharing the same Turbo Boost speeds for lightly threaded workloads is BS. (See attached table below from Marco.org).

You simply can't say that the 8-core or 6-core will have similar clocks to the 4-core on lightly threaded tasks. It's highly likely that the only CPU that will see 3.8 or 3.9GHz is the Quad. And it's entirely possible the 8-Core will never go higher than 3.5GHz

Those of you getting 4 and 8 Core CPUs... It will be interesting to see what your Turbo Clocks look like under different conditions. Have a look and please report back!

[G5]Hydra · Jan 10, 2014

VirtualRain said:
Unless someone has some other great ideas, I guess the moral of this short story is that early news about the 8, 6 and 4 core all sharing the same Turbo Boost speeds for lightly threaded workloads is BS. (See attached table below from Marco.org).

You simply can't say that the 8-core or 6-core will have similar clocks to the 4-core on lightly threaded tasks. It's highly likely that the only CPU that will see 3.8 or 3.9GHz is the Quad. And it's entirely possible the 8-Core will never go higher than 3.5GHz

Those of you getting 4 and 8 Core CPUs... It will be interesting to see what your Turbo Clocks look like under different conditions. Have a look and please report back!

Interesting work on looking at Turbo. I do wonder how accurate all the reporting is though as it's such a low level hardware operation you are trying to look at. I don't know about it being complete BS though, you have to look at it as the best possible case for you to ever get a sustained max turbo. There are lots of factors that can affect things like each chip being slightly different, the ambient air temps and the fact that the machine is brand new and Apple might need to do so more optimization. Part of the problem is that a modern desktop OS has so much going on all the time that is muddies up the water. If single threaded performance is of utmost importance though you wasted money on the hex. That being said you posted above:

VirtualRain said:
Attached is a graph of CPU frequency over the duration of a Geekbench run. It spends 24% of the time at 3.7GHz and 76% of the time at 3.6GHz.

That means it spends 100% of the time at least 100MHz above the base 3.5GHz clock. That ain't too shabby and worth a tiny hit compared to the quad's likely single threaded performance to get two more cores IMHO.

ETA- Did you test under Windows as well to see if maybe OSX might be a doing something behind the scenes to mess with the Turbo?

Derpage · Jan 10, 2014

Looks similar to when a chip isn't getting enough offset voltage thrown at it while stepping up or down.

Nugget · Jan 11, 2014

Perhaps a more meaningful test would be to measure the completion time of a fixed piece of computing work and then use that to derive the presumed CPU frequency simply by stopwatch measurements. A purely CPU-bound computation that scales linearly by clock rate might end up being a better way to measure turbo boost efficacy than just polling the reported clock rate every second.

BOINC or distributed.net or maybe the mersenne prime project client, something that's just pure math.

VirtualRain · Jan 11, 2014

Here's an interesting article at Anand that would imply Apple's firmware plays a significant role in how Turbo Boost behaves...

http://www.anandtech.com/show/6214/multicore*************the-debate-about-free-mhz

However this technology is not defined by the processor itself. The act of telling the processor to run at a certain speed is set by the motherboard, not the processor. So as part of the deal with Intel, motherboard manufacturers code in the BIOS the algorithm to make the CPU switch speeds as required. This algorithm can be aggressive, such that turbo boosts are held for a short time when CPU loading goes from low to high, or instant when CPU power is needed or not needed. This algorithm and switching speed can determine how well a motherboard performs in CPU benchmarks.

echoout · Jan 11, 2014

Nugget said:
Perhaps a more meaningful test would be to measure the completion time of a fixed piece of computing work and then use that to derive the presumed CPU frequency simply by stopwatch measurements. A purely CPU-bound computation that scales linearly by clock rate might end up being a better way to measure turbo boost efficacy than just polling the reported clock rate every second.

BOINC or distributed.net or maybe the mersenne prime project client, something that's just pure math.

Like the real-time MPG versus the average MPG readings in many cars. That makes sense.

sirio76 · Jan 11, 2014

VirtualRain said:
You simply can't say that the 8-core or 6-core will have similar clocks to the 4-core on lightly threaded tasks. It's highly likely that the only CPU that will see 3.8 or 3.9GHz is the Quad. And it's entirely possible the 8-Core will never go higher than 3.5GHz

Looking at this benchmark: http://www.primatelabs.com/blog/2013/11/estimating-mac-pro-performance/
seems that the 8core beat both the 4 and 6core in single threat performance, just like the turbo was working better.
That's the reason why I go for the 8core version, I do not necessarily need many core on my main workstation(I've a small farm for multithreaded software that offer much better price/performance than every dual Xeon ws), but the single core performance is very important while working.

Tutor · Jan 11, 2014

I like [G5]Hydra's take on the overall situation.

VirtualRain said:
LOL... thanks... I guess it is what it is then.

EDIT: I tried a bunch of benchmarks at once... of course, the CPU was pegged at 3.6GHz the whole time, which is what you'd expect for this fully loaded 6-core.

I have an odd way of viewing increase. That's why I suggested that you load up all of the cores with multiple tasks. To me the sweetest point is Turbo stage 1. Turbo stage 2's sweetness is questionable/dubious. The least sweet turbo point is Turbo stage 3. And that's the way I've always viewed Turbo. As the core counts increase, the sweet point moves closer towards no-turbo speed, unless there is hardly any drop off in the number of cores that participate at each higher stage and the stage level differences are significant. For example, if you have a 6-core running at 3.5 GHz at base and Turbo stage 1 is 3.6 GHz and everyone of those cores participates at stage 1, to me you've gained 6x100 MHz or gained 1.029x per core over their base. If the top level (Turbo stage 3) is 3.9 GHz and only one core is participating at a time at that level, then all you've gained is 4x100 MHz and for that one core's bounty of 1.114x over it's base, the other cores take breathers in-between each of their potential substitutions for that one. To me you could be paying a big price in overall performance for the 3.9 GHz bragging right. So having gotten into this crevice of my mind, it probably now comes as no surprise to you why I'd consider stage 2's sweetness dubious. You could get 4-core participation at 3.7 (which would be 4x200 MHz, but 4-cores working simultaneously is less than 6-cores working simultaneously) or you could get 2-core participation at 3.7 (which would be 2x200 MHz, but 2-cores working simultaneously is less than 6-cores working simultaneously). So if, like me, you need the power to do multiple taxing things simultaneously, then are the winner even though I know that you wanted to, but didn't, see 3.9 GHz. Check out this article when you get some free time [ http://www.computerwoche.de/fileserver/idgwpcw/files/1595.pdf ]. Among other things, it confirms what I've experienced and read elsewhere - " ... the Turbo frequency [stage reached is] based on power, temperature and current. Thus, to the extent you keep your cores cool (and if you could underclock them to lower their current draw and help keep their wattage below TDP) then you can increase the chances of your getting to those higher stages if that is where you truly want to go.

[G5]Hydra · Jan 11, 2014

sirio76 said:
Looking at this benchmark: http://www.primatelabs.com/blog/2013/11/estimating-mac-pro-performance/
seems that the 8core beat both the 4 and 6core in single threat performance, just like the turbo was working better.
That's the reason why I go for the 8core version, I do not necessarily need many core on my main workstation(I've a small farm for multithreaded software that offer much better price/performance than every dual Xeon ws), but the single core performance is very important while working.

Don't forget the 8 core also has the best core/L3 cache ratio in the MP range. It has over double the L3 of the hex, 25MB vs. 12MB (even the mighty 12 core beast has "only" 5MB more than the 8 core). For some tasks it might make a nice difference and could very well help give the 8 core better performance than you'd otherwise expect given the speeds. It isn't a mistake that the the 8 core is such a leap up in cost over the quad and hex chips, Intel/Apple are making you pay for that massive L3.

iBug2 · Jan 11, 2014

I'd say the best way to compare would be to take a 4 core and a 6 core, throw at them the exact same single threaded task and see which finishes first. If the 4 core finishes considerably faster that means the turbo on the 6 core isn't really going as high as the turbo on the 4 core.

haravikk · Jan 11, 2014

I think to see the full single core turbo-boost you need to make sure the total running processes won't add up to more than 100% CPU load, otherwise processes will need to run on more cores. So you'd be more likely to see the maximum turbo boost when your workload is limited to say, 95% load, for example.

There are also other factors. For example, while pushing all processes onto a single, faster running core might seem like a good idea in theory, it isn't necessarily the best thing to do, as the cost of swapping different processes in and out is still fairly high in order to give them each some CPU time; however, keeping them on separate cores can be better overall as you need less such switching per core, which means less wasted time. OS X can actually have a pretty huge number of threads even when idle, each of which will need to get some CPU time now and then, as even sleeping threads may wake up periodically as well, the end result is that it can be very difficult to setup a situation where you have few enough processes that running them on a single, faster core is better than spreading them out to reduce wasted time switching processes.

Of course we also don't know enough about how OS X decides how to allocate processes to cores.

So yeah, not sure you're likely to really see the "fastest" turbo boost setting. Maybe in future the technology will get better at running one core at full speed, with the others still available for idle/background processes, and OS X will take full advantage of it, but it's just one of those things that in reality you won't see a huge benefit from.

It'd be interesting to see the same tests on a 12-core CPU, as while it's maximum turbo boost speed is lower, it may still spend most of its time at the same speeds. You'd hope so anyway, given how much it costs

deconstruct60 · Jan 11, 2014

VirtualRain said:
I get what you're saying... I mean looking at Activity Monitor at idle, I have over 700 threads active from 162 processes, but CPU utilization is less than 1%.

There are two reasons to chop speed: thermal overload and nothing to do. The latter occurs alot more than folks think. CPU clocks are so much faster than storage and even RAM that at times there is nothing to do. That is exactly why Symmetric Multi threading SMT [ what Intel calls Hyperthreading] works. Function units are stall and waiting on data to do something so can weave in instructions from other threads with little slow down.

The notion that the Turbo is "hands free overclocking" is flawed. Heavy handed overclocking is excessively wasteful. Get for 'Top Fuel' drag racing but as an operation steady state it is kind of goofy from an efficiency perspective. When there is a bit less workload 'taking foot off the gas' just gives the thermal system that much more headroom later if need to change to higher level. Loading down the system with 'no-op' execution is goofy.

So two things to get to max Turbo. Thermal headroom and also workload that needs to get done. If don't have both there is little good reason to ramp up the clock to max.

Even Geekbench ( a CPU centric benchmark) is going to have bandwidth issues unless using quirky measurements that all fit in L2/L3 cache practically the time (after initially loaded ).

deconstruct60 · Jan 11, 2014

haravikk said:
I think to see the full single core turbo-boost you need to make sure the total running processes won't add up to more than 100% CPU load, otherwise processes will need to run on more cores. So you'd be more likely to see the maximum turbo boost when your workload is limited to say, 95% load, for example.

And a workload which has data that is pulled as high as possible into the memory hierarchy. For example some kind of encryption stream that was using the E5's hardware encryption function units and can simply read/write the results stream from memory. That would max out the RAM throughput while still providing work to do on the data being dragging in/out.

If there are still computation delay slots pick another computation thread that doesn't use the crypto units at all ( integer or float math?) and run a tight loop on that with limited L3 sized data set.

It is going to be a corner case workload that doesn't resemble normal workloads.

joema2 · Jan 11, 2014

deconstruct60 said:
...Operating systems try to pick which process to run where very quickly so they can let apps get back to what they were doing. That can mean they don't have time to consider every single possible option...

That is correct. Historically the thread/process scheduling frequency is called "time slice interval" or something similar. A programmable hardware real time clock generates an interrupt which halts processing and transfers control to the scheduler. I don't know the interval for OS X but typically it's rapid -- on the order of milliseconds.

For this reason OS thread schedulers usually have simple logic. Overhead would be too large for complex logic. Thread/processes are usually assigned a priority and a state, and the algorithm is usually "highest priority runnable thread gets run".

Even that simple algorithm sometimes imposes too much overhead, and for this reason some specialized databases internally implement their own thread scheduling algorithm.

In more recent years some OSs have a "class based" or "quota based" scheduler, whereby in addition to priority and state, each processes is assigned a max CPU quota or grouped into an assigned class. This allows setting limits to constrain "CPU hog" processes, but involves even more overhead.

deconstruct60 · Jan 11, 2014

sirio76 said:
Looking at this benchmark: http://www.primatelabs.com/blog/2013/11/estimating-mac-pro-performance/
seems that the 8core beat both the 4 and 6core in single threat performance, just like the turbo was working better.

Unfortunately, the forum pretty printer for that link clips the full name. The last part of that is "/estimating-mac-pro-performance/". That isn't a benchmark. A benchmark is a measurement, not an estimate.

Outside the scope where folks terminate all other apps to generate the highest drag racing score the results will be somewhat different in actual real world usage. Most normal workloads have multiple apps open at a time.

http://browser.primatelabs.com/geekbench3/search?utf8=✓&q=MacPro6,1++Mac+Pro++2013+

The estimates are close but not quite as the same hierarchy as the estimates.

----------

[G5]Hydra;18625349 said:
.... It isn't a mistake that the the 8 core is such a leap up in cost over the quad and hex chips, Intel/Apple are making you pay for that massive L3.

The 8 core is really a 10 core 2600 'hot rodded' with two cores disabled in order to boost base and turbo clock rates. Similar to the gimmick of disabling cores to boost more consistant Turbo state only it is set that way from the factory (and the QPI links are disabled). There is a 10 core , not quite as clocked , twin QPI model with the same price. Yes L3 is a factor, but there are other contributors too.

The larger L3 is somewhat of an offset to the increased memory pressure that 8 high clocked cores are going to put on the memory subsystem. Single core hot rodding with single core sized L3 data will get a larger boost but modern file sizes and data sets are larger.

nMP Turbo Boost

macrumors 603

Attachments

macrumors G5

macrumors 65816

macrumors 603

macrumors P6

macrumors G5

macrumors 68000

macrumors 65816

macrumors 603

Attachments

macrumors 603

macrumors 603

Attachments

macrumors regular

Suspended

Contributor

macrumors 603

macrumors 6502a

macrumors 6502a

macrumors 65816

macrumors regular

macrumors 601

macrumors 65832

macrumors G5

macrumors G5

macrumors 68000

macrumors G5

Our Staff