This is essentially an acknowledgement by nVidia that there's no room for growth in their core business so they're hoping their name will give them an opportunity to sidestep into a new business. The problem is they're going it alone. As long as CUDA is a proprietary API, nobody is going to adopt it. Neither Dell nor Apple want to tie themselves and their high end applications to another single source.
If a standard emerged, I could see this going somewhere in the short term-- until Intel woke up and built their own Cell. In the GPU space, the video cards are largely abstracted through either OpenGL or DirectX. If nVidia or ATI disappeared as so many have before them, the OS and application vendors wouldn't blink. They keep writing to the same interface and new hardware picks up the load.
Here, Apple (or whoever) would need to mark in their system requirements: "Requires an nVidia CUDA processor". So now you need a specific Intel CPU and a specific nVidia GPU. They won't do it.
Frankly, it's silly to have all that silicon sitting out on a separate card doing nothing unless you're running the absolute latest games with updated drivers. nVidia is realizing that. AMD figured it out a while ago and bought ATI when they had the chance.
The only issue I have is that I think the current GF8 series of cards are only FP16 - 16 bit floating point aka single precision. This may not be enough for most apps. It would be great for running test runs, but I know for scientific calcs FP16 isn't sufficient, they need FP32.
Are you familiar with char, short, long, long long, float, double, long double? All of those types exist on all processors-- but may be decomposed into smaller sub-operations if the processor doesn't support the width natively. Same would go here, I presume. From nVidia's standpoint, there's no need for anything wider than 16bit calculations. They could put 64bit data paths in but then all those extra bits would be wasting silicon and power for any calculation with less precision. Make the little pieces run fast and the people who need more precision will still see the benefit.
This is nothing like AltiVec.
This is exactly like AltiVec. It's bigger and it's off chip, but it's the same idea: fast, dedicated, vector processing.
It is probably closer to CELL than to a Coprocessor. Think of many coproccessors on one die running in parallel real fast.
Careful, I've brought this up in other threads, and people get their panties in a bunch about it... Because Apple went Intel rather than Cell they think Cell is a dead concept. Cell is exactly where we're all heading. Because of inertia, we may see this wasteful division of labor for a while before the cost of doing it this way is too prohibitive, but eventually it'll all get brought to the motherboard and then into the chipset, then into the main processor.
Mac Pros have 8 cores on them now-- and each core is wasting a huge amount of logic. Each CPU is handling a single purpose thread, but carrying all the logic necessary for any type of thread that might be thrown at it. I've got a whole core handling a stream of integers coming off the network, but I've got all the floating point and SSE logic sitting there idle. Meanwhile I'm compressing a video stream and only have one SSE unit available to that thread.
And through all this my high powered GPU is being taxed with nothing more than a progress bar...