Grand Central Dispatch and Open CL Bring Significant Performance Improvements for Optimized Applications

TheMacPotato · Sep 17, 2009

Hd2600

thewinelake · Sep 17, 2009

Where do the benefits come from?

I'd like to see figures for the independent contributions of:
- SnowLeopard running non-optimised code
- SL with OpenCL
- SL with GCD
- SL with both optimisations

Wonder if there's more detailed information anywhere?

SimonTheSoundMa · Sep 17, 2009

God of Biscuits said:
I've never seen an answer to this type of question, not even a ballpark figure....

Has anyone read what kind of speed up one might expect for h.264 video ENcoding using OpenCL on a MacPro that was maxed out with the standard-class video cards Apple offers?

just broad speedup. 10x? 100x?

Or is h.264 encoding not parallelizable enough to actually see much of a boost?

CUDA alternatives I have tried on my MacBook Pro with 2.4GHz/8600M GT go from 20fps to 150fps for 720p.

http://badaboomit.com/ is what I use. They claim 20x.

slackpacker · Sep 17, 2009

So the real question what version is this and where can I get it.

Also RIGHT HERE and now we have to start a list of Snow Leopard apps that support - Open-CL and Grand Central. I have said it all along that these 2 features in SL are the most important about performance... I hold Apple to task and ridicule for not releasing software that uses their own technology.... come on when major parts of the OS don't use it and FCS don't use it.... for shame Apple for Shame. Face Palm.....

djgamble · Sep 17, 2009

What's GCD like on dual-core Macs that can't use OpenCL?

I object to Apple not making OpenCL drivers for these machines because I think they would benefit the MOST.

Theoretical situation...

8-Core Mac Pro... video encoding... 4 seconds (Leopard)
8-Core Mac Pro... video encoding... 2 seconds (Snow Leopard)

2-Core iMac/MacBook Pro... video encoding... 10 minutes... (leopard)
2-Core iMac/MacBook Pro... video encoding... 10 minutes... (Snow Leopard)
2-Core iMac/MacBook Pro... video encoding... 5 minutes... (Snow Leopard if they made a damn driver!)

We're talking 1.5-2 year old computers here, and since Apple's ripped out ANY support for PPC machines, one would think they could support the last 4 years of Intel machines... one would think?

Here we're also talking about users who potentially have less money, and can't afford to upgrade their hardware as much... so do so every 5 years or so. These guys still pay to upgrade the software, and are interested in new technology... they also want the biggest bang for their buck.

So Mac Pro users... 4 seconds or 2 seconds... who cares? Me... 10 minutes or 5 minutes... I care because it's a lot of time! And for those who say "you're a cheap bastard..." my MBP cost significantly more than most Mac Pro's!

stylewriter · Sep 17, 2009

pmjoe said:
Let's be clear here, Grand Central Dispatch does not bring any performance improvements. It is just a library to simplify threading for some who might not otherwise do multi-threaded programming. It's the multi-threading that brings the performance improvements.

Grand Central does have some features you can't just implement inside an application in Leopard. Grand Central manages a threadpool system-wide, which allows threadpools to be used very cheaply(CPU cost wise) and allows the system to maintain an optimal number of worker threads for the CPU cores available.

The overhead of creating threadpools for large applications is minimal, but the cheap threadpools for smaller applications, or applications that may not need a threadpool often is substantial. This article is for .Net threadpools, but figure2 at http://msdn.microsoft.com/en-us/magazine/dd252943.aspx illustrates the problem of maintaining an optimal number of concurrent threads. Too few threads/core and performance plummets. Too many threads/core and performance starts heading south again. I believe OSX is the first OS to integrate global threadpools.

If you are running only one large application, then Grand Central might not give you an advantage. Grand Central is pretty neat in how it automatically allocates resources without overwhelming the computer.

stylewriter · Sep 17, 2009

djgamble said:
What's GCD like on dual-core Macs that can't use OpenCL?

I object to Apple not making OpenCL drivers for these machines because I think they would benefit the MOST.

Theoretical situation...

8-Core Mac Pro... video encoding... 4 seconds (Leopard)
8-Core Mac Pro... video encoding... 2 seconds (Snow Leopard)

2-Core iMac/MacBook Pro... video encoding... 10 minutes... (leopard)
2-Core iMac/MacBook Pro... video encoding... 10 minutes... (Snow Leopard)
2-Core iMac/MacBook Pro... video encoding... 5 minutes... (Snow Leopard if they made a damn driver!)

We're talking 1.5-2 year old computers here, and since Apple's ripped out ANY support for PPC machines, one would think they could support the last 4 years of Intel machines... one would think?

Here we're also talking about users who potentially have less money, and can't afford to upgrade their hardware as much... so do so every 5 years or so. These guys still pay to upgrade the software, and are interested in new technology... they also want the biggest bang for their buck.

So Mac Pro users... 4 seconds or 2 seconds... who cares? Me... 10 minutes or 5 minutes... I care because it's a lot of time! And for those who say "you're a cheap bastard..." my MBP cost significantly more than most Mac Pro's!

These cards were sold to you for rendering video... they just aren't capable of doing the calculations. It was normal for the engine of a graphics card to not support double precision floating point, or bastardize IEEE754 floating point numbers, they were made to render graphics fast. If you wanted to be able to be able to do math calculations on your card, then you should have made sure to get a CUDA capable graphics card.

There isn't anything anyone can do... as far as I see, there are valid reasons for every unsupported graphics chipset; which is most of them.

pmjoe · Sep 17, 2009

stylewriter said:
Grand Central does have some features you can't just implement inside an application in Leopard. Grand Central manages a threadpool system-wide, which allows threadpools to be used very cheaply(CPU cost wise) and allows the system to maintain an optimal number of worker threads for the CPU cores available.

The overhead of creating threadpools for large applications is minimal, but the cheap threadpools for smaller applications, or applications that may not need a threadpool often is substantial.

Why in the world would I want to create a thread pool when I don't need one!?!?!? Please give just ONE example of why a thread pool would be needed in a desktop application that does not provide server services or benefit from significant parallelism in a complex operation.

This article is for .Net threadpools, but figure2 at http://msdn.microsoft.com/en-us/magazine/dd252943.aspx illustrates the problem of maintaining an optimal number of concurrent threads. Too few threads/core and performance plummets. Too many threads/core and performance starts heading south again. I believe OSX is the first OS to integrate global threadpools.

The article you linked has almost nothing to do with global thread pools. Please state just ONE specific advantage global thread pools provide over one I'd create locally (for a desktop application).

If you are running only one large application, then Grand Central might not give you an advantage. Grand Central is pretty neat in how it automatically allocates resources without overwhelming the computer.

What are you calling a "resource" here? Cores? Does GCD have any knowledge of other resources on the system? How does GCD know what system resources my thread needs and when it will need them? Is it really that smart?

Rocketman · Sep 17, 2009

SimonTheSoundMa said:
CUDA alternatives I have tried on my MacBook Pro with 2.4GHz/8600M GT go from 20fps to 150fps for 720p.

http://badaboomit.com/ is what I use. They claim 20x.

To me I am not interested in what somebody can do on their late model 8 core MacPro unless they already have a room full of 10 machines and the software lets ten guys get to the pub an hour earlier every day.

I am far more interested in what it does for the lowest hardware that can take advantage of its full benefits such as this MacPro with dual graphics processors. Hopefully it lets relatively low level hardware simply do things not possible before.

Heck, the iPod Touch 2009 32/64 has OpenGL now and enough heat budget to cripple the chip speed 50% less.

What interests me most about these technologies is seeing them hit the lowest of the low. iPhone, MacBook polycarbonate, MacMini, and whatever tablet is coming out, but that will have dual core.

Rocketman

After G · Sep 17, 2009

pmjoe said:
Why in the world would I want to create a thread pool when I don't need one!?!?!? Please give just ONE example of why a thread pool would be needed in a desktop application that does not provide server services or benefit from significant parallelism in a complex operation.

If your UI stalls waiting for a task to complete, you can run the task separately while the UI keeps chugging along leaving the user free to do other tasks in the UI while waiting. For example, when a tab stalls in Safari, wouldn't it be great to switch to another tab and get some work done if one of them stalled? Also you are not creating the thread pool, GCD takes care of that. You just specify how tasks are split (arguably the hard part of multi-core programming) and how they interact and GCD manages it all. Kinda like how you don't have to micromanage all the trains at the train station when you have the workers there to do it for you.

The article you linked has almost nothing to do with global thread pools. Please state just ONE specific advantage global thread pools provide over one I'd create locally.

Global thread pools (as used by GCD) know the state of the system. This is stated as a specific example of the advantage of GCD - it will optimize the number of threads to reflect system load.

What are you calling a "resource" here? Cores? Does GCD have any knowledge of other resources on the system? How does GCD know what system resources my thread needs and when it will need them? Is it really that smart?

Yes. Actually it's not asking you to think that low level. GCD says, "Give me a task that can be broken up into blocks, and I'll figure out the best way to do it given current system resources." Namely CPU and GPU. Memory is already managed in a preemptive multitasking system. Why rewrite the memory manager when you have a good one already?

At least that's how I understood it.

Cydonia · Sep 17, 2009

@2002cbr600f4i

Thank you for that. I know nothing about programming or these technologies but that has helped me understand why this is important.

AAPLaday · Sep 17, 2009

Handbrake and isquint optimisation please

coleridge78 · Sep 17, 2009

pmjoe said:
Why in the world would I want to create a thread pool when I don't need one!?!?!? Please give just ONE example of why a thread pool would be needed in a desktop application that does not provide server services or benefit from significant parallelism in a complex operation.

The article you linked has almost nothing to do with global thread pools. Please state just ONE specific advantage global thread pools provide over one I'd create locally (for a desktop application).

What are you calling a "resource" here? Cores? Does GCD have any knowledge of other resources on the system? How does GCD know what system resources my thread needs and when it will need them? Is it really that smart?

You like to bluster a lot, but you apparently don't read about what you're talking about before you open your yap.

Yes, GCD is that smart. That's the entire point of the exercise--effectively co-ordinating the dozen or more user-space programs that are generally running at any given time to most efficiently use threads between them, so that each is maximized without stepping on others.

A side effect of this global subsystem to handle the thread dispatch and management (whoever said they didn't see any mgmt in GCD also apparently is totally ignorant of it) is that the prosaics of threads are now extremely simple, which is a win inasmuch as it's an encouragement to developers to thread tasks which lend themselves to threading but were maybe not worth the human overhead in the past.

People can say "that was already the easy part!" but it doesn't matter if it was easy--it was time-consuming, detail-intensive, and boring, which means it only happened when there was a very clear win.

Or rather, it's trying to be that smart--how effective it is, broadly, remains to be seen.

eastcoastsurfer · Sep 17, 2009

After G said:
Yes. Actually it's not asking you to think that low level. GCD says, "Give me a task that can be broken up into blocks, and I'll figure out the best way to do it given current system resources." Namely CPU and GPU.

Not quite. It is still up to the programmer to determine what, if any tasks can be broken into blocks and run in parallel. They have to determine what is shared data, what order of ops matter, dependencies, etc... This is what is hard about parallel programming (really, creating threads has been simple since Java/C# what you do with them is the hard part). Also, many programs, especially GUI ones have few operations that can run in parallel. Of course there are exceptions, but the big win for GCD will come from applications that are most likely already written to run in parallel being able to fully utilize any machine they are on because of the global pool/queue/dispatch whatever they want to call it.

eastcoastsurfer · Sep 17, 2009

pmjoe said:
Why in the world would I want to create a thread pool when I don't need one!?!?!? Please give just ONE example of why a thread pool would be needed in a desktop application that does not provide server services or benefit from significant parallelism in a complex operation.

The article you linked has almost nothing to do with global thread pools. Please state just ONE specific advantage global thread pools provide over one I'd create locally (for a desktop application).

What are you calling a "resource" here? Cores? Does GCD have any knowledge of other resources on the system? How does GCD know what system resources my thread needs and when it will need them? Is it really that smart?

Actually you're ripping on THE big feature of GCD. I remember when I first started using Linux back in the 90s. When you compiled your own kernel you could pass an option to GCC to spawn X# threads. IIRC, the ideal # of threads was the number of procs + 1 IF your machine was just going to compile. If it was going other stuff at the same time maybe you knock it down to a single thread.

I give that example to show why the global queues are a good thing. Programs that need threading can now ask for and use as many queues as they want and GCD will dynamically up and lower the actual number of threads based on cores and load. The programmer doesn't have to guess and the user doesn't have to set anything.

That is the cool part about GCD. It will not suddenly make all programs more responsive or make everything threadable or make every programmer able to code parallel code. What it will do is allow programs that have potential parallel sections fully utilize the hardware they are on.

stylewriter · Sep 17, 2009

pmjoe said:
Why in the world would I want to create a thread pool when I don't need one!?!?!? Please give just ONE example of why a thread pool would be needed in a desktop application that does not provide server services or benefit from significant parallelism in a complex operation.

How about 'ls'? Anything that can be split up into concurrent processes. Regular threadpools are too expensive to setup to use in most situations.

pmjoe said:
The article you linked has almost nothing to do with global thread pools.

The number of concurrent threads is... a global problem for a system. The chart shows the number of concurrent threads on a system vs the amount of work performed. Having multiple local threadpools that don't know about each other means that you have a LOT more than the optimal number of threads concurrently running.

pmjoe said:
Please state just ONE specific advantage global thread pools provide over one I'd create locally (for a desktop application).

I listed two... cheap threadpools and allowing the system to maintain an optimal number of worker threads. If you need a specific usage scenario; How about using a threadpool for video encoding while running a separate app for video editing(that also uses threadpools). If both apps had local threadpools that were optimized for the number of CPUs on the system, then you are just wasting resources.

pmjoe said:
What are you calling a "resource" here? Cores?

Cores and memory. Extra threads underutilizes Cores by forcing more context switching and eat up more memory.

pmjoe said:
Does GCD have any knowledge of other resources on the system? How does GCD know what system resources my thread needs and when it will need them? Is it really that smart?

Yes... how could it not? GCD manages the threadpool for all GCD apps.

haravikk · Sep 17, 2009

AidenShaw said:
I'll be polite, but this analysis is sorely lacking in understanding of threading mechanisms....

"Idle" threads don't "lose time".

I'm a bloody programmer; I didn't say "idle" threads, I said threads that barely do anything. There's a difference. The cost of a context switch just to perform a trivial operation can be huge, and in user interfaces, and many other common cases you can get a lot of these types of operations.

Threads handling network connections that aren't up to much can likewise add to the thread-"clutter" in a system, which results in crazy amounts of context-switching just to give threads a chance to run when they're not really needed. If instead a single thread were running some of these light-weight tasks, then there would be almost no cost to them, but when applications have threads dedicated to these tasks then processor time can be lost to thread-management as soon as you start opening a handful of internet connected programs or other lightweight apps.

Shiner · Sep 17, 2009

SimonTheSoundMa said:
CUDA alternatives I have tried on my MacBook Pro with 2.4GHz/8600M GT go from 20fps to 150fps for 720p.

http://badaboomit.com/ is what I use. They claim 20x.

I use bababoom for my iphone encodes. It is fast but the picture is horrible!! The CUDA cards are nice but the end product is not as good. My wife barely notices the difference from 480 to 720p on our 50" screen. She made me turn off the movie I encoded with badaboom to 720p. It looks boxee as hell!! Don't even try 1080p with that crap.

2002cbr600f4i · Sep 17, 2009

wizard said:
I have to disagree as I believe Apple refactored NSOperation and related stuff to run on top of the new threading architecture. It is the only viable explanation I have for some of the speed ups seen in older code. We maybe saying the samething but I'm specificaly saying that the infrastructure in place in SL to support GCD has had an excellent impact on existing software. Well existing software that took advantage of Apples NS threading primitives.

Well yah a rewrite can always speed things up, I'm just saying there has been a positive impact on existing software. That has a lot to do with the infrastructure put in place for GCD.

Dave

Yup Dave, we're saying the same thing... It's not that the app has gotten faster just because of GCD. It's that the OS support that the App makes use of has gotten faster because it's been rewritten to use GCD. So, yes, SL can run some things faster without apps having to be rewritten. There is SOME benefit. We won't see the rest of the benefits until the apps are modified to directly use GCD themselves.

Mr. Wonderful · Sep 17, 2009

Handbrake and Creative Suite rewrites to take advantage of these technologies are the biggest items on my wishlist.

wizard · Sep 17, 2009

I have to object.

flooce said:
The tricky thing is that the new technology underling Snow Leopard is not going to make a difference straight away, but needs to be implemented by developers.

I will have to continue to object to this idea, well written apps already benefit from SL from what I can see. In some cases I would seriously doubt the developers will do anything more to optimize specifically for SL.

Obviously there are programs that can be re-engineered to benefit from GCD, blocks and the low level features of the platform. How widely this is the case really isn't well known at the moment.

Those who read the ArsTechnica review of Snow Leopard might believe the reviewer that Grand Central technology is relatively easy to implement into the code, because rather than creating threads on your own one can basically command to "handle things over to GCD" and it will do the rest of balancing the CPU-Power between processes and cores and all this tech things.

Well it's been awhile since reading that article but I don't remember that being said. Yes GCD does the load balancing but it is still up to the programmer to find an optimal way to parallize the algorithms being used. So while the details of handling micro threads is not an issue there is no change in the efforts required to find the parallel code.

Well anyhow, it doesn't matter too much now, we will see benefits here over time though, be patient.

Well this is certainly true! It be iteresting to see what apps adopt GCD heavily and more so which apps go one step further and adopt OpenCL.

Dave

Amdahl · Sep 17, 2009

djgamble said:
We're talking 1.5-2 year old computers here, and since Apple's ripped out ANY support for PPC machines, one would think they could support the last 4 years of Intel machines... one would think?

I guess you're making the mistake in thinking they dropped PPC to save time. No, they dropped PPC to screw customers. And that's what they are doing to you, too. This won't stop until customers stop excusing Apple for this kind of behavior. Of course, they dropped 'Computer' from their name because they make toys now.

Here we're also talking about users who potentially have less money, and can't afford to upgrade their hardware as much... so do so every 5 years or so. These guys still pay to upgrade the software, and are interested in new technology... they also want the biggest bang for their buck.

That makes you a poor customer.

It used to be you bought a Mac, you got long life out of it. Now, you buy a Mac. And then you buy another one within two years, or you are a worthless Apple person.

So Mac Pro users... 4 seconds or 2 seconds... who cares? Me... 10 minutes or 5 minutes... I care because it's a lot of time! And for those who say "you're a cheap bastard..." my MBP cost significantly more than most Mac Pro's!

Next time, don't give Apple your money unless they promise you a certain number of years of support.

wizard · Sep 17, 2009

"Some" can be rather significant.

2002cbr600f4i said:
Yup Dave, we're saying the same thing... It's not that the app has gotten faster just because of GCD. It's that the OS support that the App makes use of has gotten faster because it's been rewritten to use GCD. So, yes, SL can run some things faster without apps having to be rewritten.

This really appears to be a very good thing for apps making use of Apples higher level threading APIs. For people programming with those APIs there may be little incentive to go to the low level of the GCD primitives.

There is SOME benefit. We won't see the rest of the benefits until the apps are modified to directly use GCD themselves.

If they are modified. The thing is if your app is 20 to 50% faster using the high level Cocao threading features then maybe the developer will spend his time on other parts of the app.

What it comes down to is the incentive always there for a developer to drop down to the low level nitty gritty of GCD. Sometimes the answer will be no simply because Apple has improved the high level routines so the developer is freed up to work on other parts of the program.

I realize that many programs will need to be recorded for GCD and that we will see impressive results. That is a given for programs that can be parallized heavily. On the otherhand cooling the idea that some have in this thread that all refactored programs will suddenly run much faster than todays version is in order. Here I'm not talking so much to you as to the individuals that seem to believe that a year from now we will all be shocked by how fast some apps run. For some apps you won't see much more than what we currently see in speed increases.

Dave

2002cbr600f4i · Sep 17, 2009

thewinelake said:
I'd like to see figures for the independent contributions of:
- SnowLeopard running non-optimised code
- SL with OpenCL
- SL with GCD
- SL with both optimisations

Wonder if there's more detailed information anywhere?

Well, I think that would make for a GREAT test/example. The problem is coming up with some sort of algorithm/test problem that could be applied across like that....

Something like:

1) Perform operation across huge array of data, linearly.
2) Do #1, but do it by making your own threads to do it in parallel
3) Do it again, but using the GCD features instead of your own threads
4) do it again, but using OpenCL
5) do it again, but use some sort of combination of 3+4.

Note, OpenCL will run any code sent to it on your CPU if there isn't a capable video card in your system. And when it does so, the CPU's OpenCL driver uses blocks and GCD's queues to do the processing. If the code is getting executed on the GPU, it uses whatever facilities the GPU hardware + driver uses.

If anyone can come up with an interesting / compelling test like that I'd love to take a crack at coding up 1,3,4 and 5 (I've never done manual threads in Objective-C on OSX so I have no idea how to write #2.)

2002cbr600f4i · Sep 17, 2009

wizard said:
On the otherhand cooling the idea that some have in this thread that all refactored programs will suddenly run much faster than todays version is in order. Here I'm not talking so much to you as to the individuals that seem to believe that a year from now we will all be shocked by how fast some apps run. For some apps you won't see much more than what we currently see in speed increases.

Dave

Agreed! Nothing has annoyed me more since SL's release than seeing all these people on the boards scream that "SL didn't make all my programs run faster! Why isn't OpenCL doing anything on my system? SL is nothing more than a Service Pack!" etc.

I chalk that up to, they just don't know how the underlying stuff really works. The changes they made to SL by adding GCD + OpenCL has the POTENTIAL to make the system faster (if for no reason other than some of the underlying OS code being able to use them to speed up some things, like CoreImage getting a 25% boost due to it using OpenCL now), but it's not a magic bullet where installing SL suddenly makes everythign go 50% faster like some people think.

No, these changes introduced in SL provide DEVELOPERS with tools that can help make their applications, where applicable, faster, and by using those features, in turn make the overall system more responsive and better able to make use of all the hardware available (rather than having 1/2 your cores sitting around doing nothing while the other 1/2 are maxed out.) Again, that's an ideal situation, but I think as we start to see more apps converted to use these features, we'll see utilization of the hardware improve, especially when running multiple applications simultaneously, and the OS will remain responsive even in those situations.

Yes, there's nothing here (at least with GCD) that good developers couldn't already do by creating their own threads in their programs. These tools just make it easier to do that so it's not as much a headache to go ahead and use a multithreaded approach when possible rather than saying "I dont' want to deal with writing this multithreaded even though this is a good place to do it!" So, yeah, existing well written multithreaded programs probably won't see much of a boost from GCD. But if the tools help other developers make their programs more multithreaded, then it shouldn't hurt at all.

Grand Central Dispatch and Open CL Bring Significant Performance Improvements for Optimized Applications

macrumors regular

macrumors newbie

macrumors 65816

macrumors 6502a

macrumors 65816

macrumors member

macrumors member

macrumors 6502

macrumors 603

macrumors 68000

macrumors member

Guest

macrumors 6502a

macrumors 6502a

macrumors 6502a

macrumors member

macrumors 65832

macrumors 6502

macrumors 6502

macrumors 6502a

macrumors 68040

macrumors 65816

macrumors 68040

macrumors 6502

macrumors 6502

Our Staff