I'd be very interested in the results! A few notes though to do with OpenCL on your proposed tasks:
1. OpenCL (when running on the GPU) is bound by the bandwidth between GPU and CPU. This tends to be around 5GB per second -- but the GPU can do many times more float point operations per second than this (about 200 times more!). So when you choose a benchmark, you should choose one where the number of operations per piece of data that you transfer is very high. Preferably don't choose a O(n) type operation.
2. OpenCL (and GPUs!) aren't designed for text. Also making something uppercase is a O(n) operation (where n is length of the string) so you're bandwidth bound (see point 1). This makes it not the best OpenCL benchmark.
3. It's very difficult (though possible) to write an efficient sorting algorithm on OpenCL. In fact it's so hard to do well that this is the sort of stuff people publish papers on in the GPGPU industry!
Well, I have my little (java) data generator done and working. That's as far as I've gotten...
It's been nearly a year since I've touched Objective-C and it's amazing how much I've forgotten, so I don't know that I'll get anywhere significant on this today.
That being said, my plan was to not use ANY of the library calls for things like converting a string to uppercase, or doing the matrix multiplication. Everything was going to be done using loops, or breaking the data set up and assigning subsets of the data to different threads/cores.
Actually, uppercasing a word IS mathematical in that you have to look at the ASCII code of each character in the string (so yeah, that's O(n) per string, O(n^2) across the set), see if it's in the lowercase set, and if not, add in the offset value to get the uppercase one. The fact that it's an O(n) operation, but that no 2 items in the array are data related to each other SHOULD allow a threaded process the ability to do this quicker than O(n^2) as subsets of the full set can be processed by each thread.
The matrix multiplication I'll need to check the math on, as I don't remember the rules for performing that. My hope was that I could find some operation where I could ship off subsets of the data (like a whole row) to each thread/core) but I'm not sure multiplication on arrays is independant like that.
As far as the sorting - strange, there's several good multithreaded sorting algorithms (variations on Quicksort for instance). I would think one of those would work in OpenCL code, but maybe I'm wrong.
The key to this test is to do things as similarly as possible between the various methods, so you're not using anything super optimized for any of them. We're just trying to demonstrate that linear code gets a speed up from normal Leopard threads, then how that same code behaves using GCD, and then how it behaves again using OpenCL.
Anyhow, my grad school classes start back up on Tuesday, so I don't know that I'm going to even get the single threaded version done before then. After Tuesday I'm slammed. If there's somebody reading this who is a much better Obj-C OSX programmer than me who wants to take a crack at this, let me know and I'll explain what I was looking into doing and how...