Grand Central Dispatch and Open CL Bring Significant Performance Improvements for Optimized Applications

Sharangad · Sep 19, 2009

djgamble said:
Apple's just been MEGA lazy with the drivers!!!
We're talking 3.5 years of computers... they've chosen not 2 support 2 VERY POPULAR ATI cards (which basically cover everybody who doesn't have the latest round of MBP's or Mac Pro's.)

It's just a grudge match against ATI... Apple ALWAYS does this...
OS X OpenGL - didn't support the older ATI cards (OS 9 OpenGL did though? Apple got sued big time for that one because they'd sold a bunch of iMacs to a HUGE law firm... claiming they were OS X capable as they had G3's)
Quartz extreme - didn't support the Rage Pro...
Quartz extreme (when updated) - didn't support the GeForce 2
OpenCL - doesn't support computers more than 2 years old (although it was drafted out and announced before then)

Unfortunately your 3870 doesn't support shared memory which is required for OpenCL. All nvidia DirectX10 cards supports this but of Ati cards only the HD4000 series support this. Your 3870 will never support OpenCL.

As for your older cards, The Rage Pro was a DX5 generation card, which was probably why it was dropped.
Quartz Extreme required DirectX 9 class (programmable hardware), the geforce 2 was a DX7 card and didn't have programmable shaders (cores) to run QuartzExtreme.

Now the 3870 doesn't have shared memory despite being a DX10 card ( as DX10 didn't require it. However nvidia's all of nvidia's DX10 cards support it as they were designed for CUDA and OpenCL is just a non-proprietary version of CUDA).

If anything you should blame Ati for building such a lame card.

holmesf · Sep 19, 2009

2002cbr600f4i said:
Given that it's been pouring here for the last 4 days, and they're calling for more rain all weekend (anyone got an Arc they can loan me???) I might go ahead and take a crack at writing some sort of program to test the various configurations discussed earlier in the thread.

Some possible ideas I had:

1) a process that takes 2 very huge (100000x100000 each) arrays of numbers and multiplies them together
2) something that takes a huge list of random text names in mixed case and converts them into uppercase.
3) something that does some kind of sort of a large amount of data (maybe the outputs of 1 and 2?) This is assuming I can find and implement a sort algorithm that can be done both linearly and threaded.

I might even try to make all 3 be run to REALLY stress things.

Note, that I plan to pre-generate all the test data so the exact same "random" data can be run over and over, taking variations in the data out of the "benchmarking" equation, regardless of user running it and computer.

So, my plan would be:

* Implement each of these things in a linear processing fashion.
* Implement each of these things to be done using blocks + GCD
* Implement each of these to be done by OpenCL

(Once I get the 1st or 2nd one implemented, I'll toss up the source code someplace and ask somebody who's more familiar with OSX development to write the normal PThreads type implementation since I don't have a clue how to do that.)

I'll keep you all updated to my progress over the weekend. I haven't done much Objective C - (in fact it's been nearly a year since I've touched it) but I think I'll be able to figure out how to do at least some of this this weekend.

I'd be very interested in the results! A few notes though to do with OpenCL on your proposed tasks:

1. OpenCL (when running on the GPU) is bound by the bandwidth between GPU and CPU. This tends to be around 5GB per second -- but the GPU can do many times more float point operations per second than this (about 200 times more!). So when you choose a benchmark, you should choose one where the number of operations per piece of data that you transfer is very high. Preferably don't choose a O(n) type operation.

2. OpenCL (and GPUs!) aren't designed for text. Also making something uppercase is a O(n) operation (where n is length of the string) so you're bandwidth bound (see point 1). This makes it not the best OpenCL benchmark.

3. It's very difficult (though possible) to write an efficient sorting algorithm on OpenCL. In fact it's so hard to do well that this is the sort of stuff people publish papers on in the GPGPU industry!

2002cbr600f4i · Sep 20, 2009

holmesf said:
I'd be very interested in the results! A few notes though to do with OpenCL on your proposed tasks:

1. OpenCL (when running on the GPU) is bound by the bandwidth between GPU and CPU. This tends to be around 5GB per second -- but the GPU can do many times more float point operations per second than this (about 200 times more!). So when you choose a benchmark, you should choose one where the number of operations per piece of data that you transfer is very high. Preferably don't choose a O(n) type operation.

2. OpenCL (and GPUs!) aren't designed for text. Also making something uppercase is a O(n) operation (where n is length of the string) so you're bandwidth bound (see point 1). This makes it not the best OpenCL benchmark.

3. It's very difficult (though possible) to write an efficient sorting algorithm on OpenCL. In fact it's so hard to do well that this is the sort of stuff people publish papers on in the GPGPU industry!

Well, I have my little (java) data generator done and working. That's as far as I've gotten...

It's been nearly a year since I've touched Objective-C and it's amazing how much I've forgotten, so I don't know that I'll get anywhere significant on this today.

That being said, my plan was to not use ANY of the library calls for things like converting a string to uppercase, or doing the matrix multiplication. Everything was going to be done using loops, or breaking the data set up and assigning subsets of the data to different threads/cores.

Actually, uppercasing a word IS mathematical in that you have to look at the ASCII code of each character in the string (so yeah, that's O(n) per string, O(n^2) across the set), see if it's in the lowercase set, and if not, add in the offset value to get the uppercase one. The fact that it's an O(n) operation, but that no 2 items in the array are data related to each other SHOULD allow a threaded process the ability to do this quicker than O(n^2) as subsets of the full set can be processed by each thread.

The matrix multiplication I'll need to check the math on, as I don't remember the rules for performing that. My hope was that I could find some operation where I could ship off subsets of the data (like a whole row) to each thread/core) but I'm not sure multiplication on arrays is independant like that.

As far as the sorting - strange, there's several good multithreaded sorting algorithms (variations on Quicksort for instance). I would think one of those would work in OpenCL code, but maybe I'm wrong.

The key to this test is to do things as similarly as possible between the various methods, so you're not using anything super optimized for any of them. We're just trying to demonstrate that linear code gets a speed up from normal Leopard threads, then how that same code behaves using GCD, and then how it behaves again using OpenCL.

Anyhow, my grad school classes start back up on Tuesday, so I don't know that I'm going to even get the single threaded version done before then. After Tuesday I'm slammed. If there's somebody reading this who is a much better Obj-C OSX programmer than me who wants to take a crack at this, let me know and I'll explain what I was looking into doing and how...

gattis · Sep 20, 2009

I tried to optimize some of the stuff I am developing using Snow Leopard and OpenCL. The limited bandwidth over the PCI Express bus actually makes simple matrix multiplications many times slower than a multicore CPU on most cards. Outside of graphics applications, I don't really see the potential for GPGPU until we see a unified memory architecture where GPUs can access the RAM as fast as the CPU. And graphics applications can already use the GPU using older libraries such as OpenGL which do common operations for you. I don't see where OpenCL adds any benefit just yet, but its good that Apple is getting there early on the software front. We just have to wait for the hardware to catch up now.

Anyway, here's my writeup on my experimentation with Snow Leopard's OpenCL:

http://mattgattis.com/blog/2009/09/19/opencl-impressions/

holmesf · Sep 20, 2009

2002cbr600f4i said:
Well, I have my little (java) data generator done and working. That's as far as I've gotten...

It's been nearly a year since I've touched Objective-C and it's amazing how much I've forgotten, so I don't know that I'll get anywhere significant on this today.

That being said, my plan was to not use ANY of the library calls for things like converting a string to uppercase, or doing the matrix multiplication. Everything was going to be done using loops, or breaking the data set up and assigning subsets of the data to different threads/cores.

Actually, uppercasing a word IS mathematical in that you have to look at the ASCII code of each character in the string (so yeah, that's O(n) per string, O(n^2) across the set), see if it's in the lowercase set, and if not, add in the offset value to get the uppercase one. The fact that it's an O(n) operation, but that no 2 items in the array are data related to each other SHOULD allow a threaded process the ability to do this quicker than O(n^2) as subsets of the full set can be processed by each thread.

I agree that a threaded process can do it faster than an unthreaded process, but on the GPU you've got the problem that since you're doing such little work per character (just a possible add) it's not worth transferring the characters to the GPU. I also don't think the GPU would handle the conditional check that you have to do on each character very well (where you check if it's lower case before adding) ... the reason is that for good performance on the GPU you have to arrange the data so that related threads (called a warp) all take the same path through each conditional expression ... maybe there's a way around this, but otherwise each warp (usually set of 32 threads) ends up running in serial instead of parallel ... ouch a 32x performance decrease!

2002cbr600f4i said:
The matrix multiplication I'll need to check the math on, as I don't remember the rules for performing that. My hope was that I could find some operation where I could ship off subsets of the data (like a whole row) to each thread/core) but I'm not sure multiplication on arrays is independant like that.

Matrix multiplication is actually a very good candidate for GPGPU processing. The reason is that there is a high ratio of work done per floating point number transferred. Dividing up that work is still tricky though.

2002cbr600f4i said:
As far as the sorting - strange, there's several good multithreaded sorting algorithms (variations on Quicksort for instance). I would think one of those would work in OpenCL code, but maybe I'm wrong.

The key to this test is to do things as similarly as possible between the various methods, so you're not using anything super optimized for any of them. We're just trying to demonstrate that linear code gets a speed up from normal Leopard threads, then how that same code behaves using GCD, and then how it behaves again using OpenCL.

Ok.

Unfortunately with OpenCL sometimes it's super-optimized or there's just no point. GPUs are strange beasts -- every step of the algorithm you use and its details needs to take their architecture into account. If you screw up your memory access pattern on a GeForce 8800, for example, your memory bandwidth can suddenly be cut in 16. Sorting is a problem where it's especially difficult to get this right.

Anyway, I said that people write papers on this sort of stuff, so here's one if you have time:
http://mgarland.org/files/papers/gpusort-ipdps09.pdf

2002cbr600f4i said:
Anyhow, my grad school classes start back up on Tuesday, so I don't know that I'm going to even get the single threaded version done before then. After Tuesday I'm slammed. If there's somebody reading this who is a much better Obj-C OSX programmer than me who wants to take a crack at this, let me know and I'll explain what I was looking into doing and how...

Cool, I'm starting my master's program in CS on Thursday!

bmb012 · Sep 20, 2009

Strange how all of this is still so up in the air, you'd think Apple would have some sort of proof of concept 'killer-app,' or at least some samples ready to show the benefits of all this.

What about physics calculations? Aren't those PPUs basically just GPUs that you can run physics code on? Any idea if PhysX will support openCL?

holmesf · Sep 20, 2009

bmb012 said:
Strange how all of this is still so up in the air, you'd think Apple would have some sort of proof of concept 'killer-app,' or at least some samples ready to show the benefits of all this.

What about physics calculations? Aren't those PPUs basically just GPUs that you can run physics code on? Any idea if PhysX will support openCL?

Since PhysX can run using CUDA on those GeForce cards that support CUDA, and since OpenCL is based on CUDA, my guess would be yes, PhysX could be written to use OpenCL.

As for killer apps ... Apple hasn't bundled anything with the OS but if you look in the Snow Leopard XCode examples you'll find some samples that are pretty good demonstrators. The n-body problem example even has a little speed gauge that lets you compare the performance of the app running on GPU, CPU, and GPU+CPU.

SPUY767 · Sep 21, 2009

DUSTmurph said:
Yeah developers, start taking advantage of this ****.

The sad part is, it borders on trivial to add GCD and OCL functionality to applications if they've been programmed with XCode and follow Apple's API guidelines religiously. Granted, some houses like Adobe are just gluttons for punishment and would rather use an inferior IDE and API.

holmesf · Sep 21, 2009

SPUY767 said:
The sad part is, it borders on trivial to add GCD and OCL functionality to applications if they've been programmed with XCode and follow Apple's API guidelines religiously. Granted, some houses like Adobe are just gluttons for punishment and would rather use an inferior IDE and API.

Adding OpenCL support is far from trivial.

GCD is easier, but if you've got a gigantic codebase like Adobe making the slightest change in the internals is probably a gargantuan undertaking!

2002cbr600f4i · Sep 21, 2009

holmesf said:
Adding OpenCL support is far from trivial.

GCD is easier, but if you've got a gigantic codebase like Adobe making the slightest change in the internals is probably a gargantuan undertaking!

Not to mention Adobe has to do the whole Carbon-->Cocoa conversion as well..

pmjoe · Sep 23, 2009

So, is the conclusion to this thread that Apple should develop a "physics application" to demonstrate the benefits of GCD and Open CL to the masses?

I'm sure that'll be a hot download.

I wonder what that says for those of us who said pages ago this stuff isn't that useful for most mainstream applications?

holmesf · Sep 23, 2009

pmjoe said:
So, is the conclusion to this thread that Apple should develop a "physics application" to demonstrate the benefits of GCD and Open CL to the masses?

I'm sure that'll be a hot download.

I wonder what that says for those of us who said pages ago this stuff isn't that useful for most mainstream applications?

You mean not everybody spends their day ripping blue ray movies, doing medical tomography, and running fluid simulations? I feel so lonely.

NT1440 · Sep 23, 2009

ThomasJL said:
And Flash still slows the computer down and makes the fans spin like crazy.

Why does Flash still suck?

Seriously, even a cheap Winblows netbook can run Flash better than a top-of-the-line maxed-out Mac Pro.

BLAME ADOBE

holmesf · Sep 23, 2009

NT1440 said:
BLAME ADOBE

Indeed ... flash on Mac has always sucked.

I remember back in the year 2000 the shame of having my PowerMac G3 pale in comparison to an eMachines piece of junk at running flash. And to add insult to injury, you had to use that damn hockey puck mouse

medip · Sep 28, 2009

Help Me!!!!!

i am raging so hard. i baught the new mac book pro 15 inch for $3000, and i installed parallel tools on it, and it was working fine for the first few weeks, but now it freezes in windows and in mac os! i cant believe i spent that sort of money on this piece of trash, i need help. i do play games that use alot of ram and stuff but not enough to completely crap my comp out. any tipswould be helpful. i already tried changing the settings for my virtual machine but it made windows slower and crapper, and when i changed them back it was still the same, and i also reinstalled the mac drivers for windows. please someone help me.

diamond.g · Sep 28, 2009

medip said:
i am raging so hard. i baught the new mac book pro 15 inch for $3000, and i installed parallel tools on it, and it was working fine for the first few weeks, but now it freezes in windows and in mac os! i cant believe i spent that sort of money on this piece of trash, i need help. i do play games that use alot of ram and stuff but not enough to completely crap my comp out. any tipswould be helpful. i already tried changing the settings for my virtual machine but it made windows slower and crapper, and when i changed them back it was still the same, and i also reinstalled the mac drivers for windows. please someone help me.

Simple fix, uninstall Windows

CQd44 · Sep 28, 2009

medip said:
i am raging so hard. i baught the new mac book pro 15 inch for $3000, and i installed parallel tools on it, and it was working fine for the first few weeks, but now it freezes in windows and in mac os! i cant believe i spent that sort of money on this piece of trash, i need help. i do play games that use alot of ram and stuff but not enough to completely crap my comp out. any tipswould be helpful. i already tried changing the settings for my virtual machine but it made windows slower and crapper, and when i changed them back it was still the same, and i also reinstalled the mac drivers for windows. please someone help me.

I suppose you should try boot camp.

prostuff1 · Sep 28, 2009

medip said:
i am raging so hard. i baught the new mac book pro 15 inch for $3000, and i installed parallel tools on it, and it was working fine for the first few weeks, but now it freezes in windows and in mac os! i cant believe i spent that sort of money on this piece of trash, i need help. i do play games that use alot of ram and stuff but not enough to completely crap my comp out. any tipswould be helpful. i already tried changing the settings for my virtual machine but it made windows slower and crapper, and when i changed them back it was still the same, and i also reinstalled the mac drivers for windows. please someone help me.

You seriously expected to be able to run games at a decent pace from within parallels.

If you want to do any gaming I suggest you set up your computer so that you can boot windows. Install BootCamp and set it up, that way you can play your games to your hearts content and get the performance needed.

Search

Search

Grand Central Dispatch and Open CL Bring Significant Performance Improvements for Optimized Applications

Sharangad

macrumors member

holmesf

macrumors 6502a

2002cbr600f4i

macrumors 6502

gattis

macrumors newbie

holmesf

macrumors 6502a

bmb012

macrumors 6502

holmesf

macrumors 6502a

SPUY767

macrumors 68020

holmesf

macrumors 6502a

2002cbr600f4i

macrumors 6502

pmjoe

macrumors 6502

holmesf

macrumors 6502a

NT1440

macrumors P6

holmesf

macrumors 6502a

medip

macrumors newbie

diamond.g

macrumors G5

CQd44

macrumors 6502a

prostuff1

macrumors 65816

Our Staff