Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
Just a warning, 4GB ram might not be enough for Mavericks and routine tasks, which makes me wonder about new Macs being sold with... 4GB RAM. After using it for the first time yesterday (after the upgrade indescribrped a few posts back), I hit a hard core spinning ball last night and all I had been doing was surfing with Firefox. Yes I had a bunch of tabs open (about 15), but I don't consider what I was doing as extreme multitasking. Maybe it is?

It looks like a RAM upgrade will be a requirement for me not to lose my mind. Mavericks might handle memory better, but it seems like it has higher overhead, a situation I specifically wanted to avoid. When it hits the wall, it hits the wall hard. Everything stopped, even my wifi connection dropped. It took about 10 min to recover enough that I could restart it. Fortunately I have a PC that volunteered to take up the slack. :p
 
Thanks for this great info. I hope the launch of the new MacPro (and the fact that this machine is "locked" with dual GPUs) will bring this matter on the table, in the near future. As far as I can tell by watching some benchmarks, it seems that the Windows implementation of crossfire literally doubles (at least in most cases) the performance. Do you happen to know if this is just a "rendering ping-pong" method between the 2 GPUs or something more sophisticated ?

Well, there are number of methods to do multi-GPU rendering. I am not really sure which is the most commonly used one nowadays. Instead of describing how they work I will just point you to Wikipedia, which gives you a good general understanding of the matter: http://en.wikipedia.org/wiki/Scalable_Link_Interface#SLI_Modes. But yes, ping-ponging seems to one of the most popular ones. It is quite a big problem to get the timing right though.

----------

[/COLOR]
Just a warning, 4GB ram might not be enough for Mavericks and routine tasks, which makes me wonder about new Macs being sold with... 4GB RAM. After using it for the first time yesterday (after the upgrade indescribrped a few posts back), I hit a hard core spinning ball last night and all I had been doing was surfing with Firefox. Yes I had a bunch of tabs open (about 15), but I don't consider what I was doing as extreme multitasking. Maybe it is?

It looks like a RAM upgrade will be a requirement for me not to lose my mind. Mavericks might handle memory better, but it seems like it has higher overhead, a situation I specifically wanted to avoid. When it hits the wall, it hits the wall hard. Everything stopped, even my wifi connection dropped. It took about 10 min to recover enough that I could restart it. Fortunately I have a PC that volunteered to take up the slack. :p

I don't think this is 4GB which is biting you here. What you are describing is a really serious stall, I have never experienced something like this with out machines (we have plenty of Minis running Mavericks which only have 4GB or less RAM).
 
Well, there are number of methods to do multi-GPU rendering. I am not really sure which is the most commonly used one nowadays. Instead of describing how they work I will just point you to Wikipedia, which gives you a good general understanding of the matter: http://en.wikipedia.org/wiki/Scalable_Link_Interface#SLI_Modes. But yes, ping-ponging seems to one of the most popular ones. It is quite a big problem to get the timing right though.

On windows the "ping pong" is mostly down with the drivers and the game does not have to deal with the syncing or assets between cards. On theMac CrossFire or the NV equivalent does not exist so you have to do all this yourself.

Game engines are not designed to easily swap contexts between cards as the next frame often depends on data from the previous one more so than lets say using them to speed up rendering. It's not impossible and there is some potential but it is not likely going to be simple to implement and when implemented will likely be a custom solution per game.

But this is a new technology on the Mac so over time things will likely get easier to use and implement. I'd be able to answer this better in a few months once I can have a look at how it all works. Being a very rare setup it is lower on my to-do list than a few other things :)

Edwin
 
On windows the "ping pong" is mostly down with the drivers and the game does not have to deal with the syncing or assets between cards. On theMac CrossFire or the NV equivalent does not exist so you have to do all this yourself.

Game engines are not designed to easily swap contexts between cards as the next frame often depends on data from the previous one more so than lets say using them to speed up rendering. It's not impossible and there is some potential but it is not likely going to be simple to implement and when implemented will likely be a custom solution per game.

I don't really see why this is true. On OS X, you are already supposed to use multiple shared contexts in parallel do do asynchronous asset loading. Syncing is quite trivial to do with the sync objects. Same goes for your comment to reusing data from the previous frame - as you would share all the objects between your contexts anyway this is also quite trivial.

The challenge is properly managing your threads (you'd want 2 rendering threads and a control one, which does animation) and maybe doing output - as only one GPU can output the frame, you need to render to a FBO on the inactive one and then copy it to the main framebuffer on the active GPU. But its certainly doable. And it is probably better than what the driver can do - after all, you know best how you do your rendering and can adjust your syncing appropriately. Of course, this whole story rises and falls with how well Apple's OpenGL supports shared contexts (especially ones which use different GPUs). I have yet to do some testing on that.

But this is a new technology on the Mac so over time things will likely get easier to use and implement. I'd be able to answer this better in a few months once I can have a look at how it all works. Being a very rare setup it is lower on my to-do list than a few other things :)

Its not at all new. Double-GPU macs have existed for years. To try this out you just need a MacBook Pro with iGPU+dGPU ;) Of course, it makes less sense there because of the performance asymmetry..
 
Its not at all new. Double-GPU macs have existed for years. To try this out you just need a MacBook Pro with iGPU+dGPU ;) Of course, it makes less sense there because of the performance asymmetry..

That's not even remotely the same thing.
 
That's not even remotely the same thing.

From programmer's standpoint, it is exactly the same thing. The code is the same (except performance tweaks). The only difference is the performance characteristics. You can certainly use the iGPU to prototype a multi-GPU rendering engine.
 
Syncing is quite trivial to do with the sync objects. Same goes for your comment to reusing data from the previous frame - as you would share all the objects between your contexts anyway this is also quite trivial.
I'd shy away from telling people things are "trivial". I don't think you meant it that way but it sounded very dismissive. I don't code on Feral games very much being more production based in recent years but I do spend time a lot of working with driver teams at AMD plus AppleGL etc. After looking at this problem a few times (first time I discussed this as a possible option was with the G4 Macs years ago so I have thought about it a few times now) it is possible but hardly trivial. Your example is a very good example of a simple game engine that you could use to test multiple GPUs. When you get a complex modern game with physics, multiple pass rendering, dynamic shader generation with changing variables the syncing problem suddenly gets more complex.

The best generic option I had thought of that avoids these issues would be to look at rendering half the frame on each card and DMAing one half to the other card before final render to screen. However the best method would vary from game to game depending on the game engine and I would need a MacPro and some spare time to test various methods and ideas before stating what is the best way of doing it as reality and theory often are miles apart. :)

as only one GPU can output the frame, you need to render to a FBO on the inactive one and then copy it to the main framebuffer on the active GPU.
Actually that has some unwanted overhead based on a few tests, to get the best performance you can/should DMA the memory between the cards but without a MacPro I can't be sure the DMA method works as expected. However this is likely how Apple get their performance in FCP.

But its certainly doable. And it is probably better than what the driver can do - after all, you know best how you do your rendering and can adjust your syncing appropriately.
I never said it was not doable, in fact I said the opposite :)

I did mention on the PC the game drivers do the work for you you don't have to write it yourself inside the game engine(s). They do this for a reason as doing it in game by the developers is more complex and more work. Having it in the drivers makes it a lot easier for everyone as the game does not need to do all the shuffling the drivers can do it, it might not be quite as fast as a hand written and optimised solution per game but it will work on all games.

this whole story rises and falls with how well Apple's OpenGL supports shared contexts
Partly you also have to take into account the game engines, the problem is not writing a game engine from scratch that supports multiple GPUs but taking a very complex game engine that does not support multiple GPUs and modifying it to work with them. That is harder as many engineering decisions have already been made before you start and plenty of them are serial in nature.

Some things like rendering on one card and displaying on another can be done and I have seen Feral games do this when testing various library and OS X features like window mode. If it was all down to a few simple render commands and DMAing the final frame over then it would be fairly trivial but that is the easy bit. :)

I said in my post I am sure you will be able to get gains but as the potential benefits for your average Mac user are not that high (as the main benefit is on a rare super high performance Mac) there are much more productive speed boosts to be found in other areas. The idea of using the iGPU is one I have looked at back when the HD3000 was bundled in MBP machines but the overhead and weak iGPU performance looked to outweigh the benefits in using two GPUs. With the new HD5000 cards and the improved Intel drivers in Mavericks it might be worth another look once other performance boosts that effect all users are completed.

From programmer's standpoint, it is exactly the same thing. The code is the same (except performance tweaks). The only difference is the performance characteristics. You can certainly use the iGPU to prototype a multi-GPU rendering engine.

From a programmers standpoint optimising for a Playstation and a Mac are exactly the same thing in theory. ;) However in reality it's different. Same thing with your example, in theory it's the same and uses the same code. In reality getting it to give you a usable performance advantage with correct rendering is another.

You can certainly use the iGPU to prototype a multi-GPU rendering engine.

Yep, converting a non multi-GPU rendering engine into a multi-GPU rendering engine without breaking or rewriting the game is a different and more complex problem to solve.

I know this post has been mostly highlighting a few issues and problems with your suggestions but I am not dismissing your views or the very useful commentary for others reading this thread. I am glad someone has done some reading up and raised a few questions. My aim is to highlight although designing a game engine to support multi GPUs is not a huge problem (it's still more complex than your trivial suggests) converting a game engine that is not designed to support multiple GPUs to one that can without any of the driver tools available on other platforms that assist with this is a much more complex and harder to solve problem than making a game engine from scratch.

What's more as the benefits would only boost users with already super powerful machines it is a lower priority on developer time. The time spent on making a MacPro (or even a MacBookPro) faster will not benefit users with the lower performance Macs (or single GPUs). Making the fast machines faster at the expense of the slower ones is not a good decision so in terms of allocating programmer time. This is why multi GPUs historically have been on the "if we have time" list not the "important to fix before release" list. We constantly evaluate the situation but that's where I am coming from from a games developers/publishers standpoint.

Edwin
 
I'd shy away from telling people things are "trivial". I don't think you meant it that way but it sounded very dismissive. I don't code on Feral games very much being more production based in recent years but I do spend time a lot of working with driver teams at AMD plus AppleGL etc. After looking at this problem a few times (first time I discussed this as a possible option was with the G4 Macs years ago so I have thought about it a few times now) it is possible but hardly trivial. Your example is a very good example of a simple game engine that you could use to test multiple GPUs. When you get a complex modern game with physics, multiple pass rendering, dynamic shader generation with changing variables the syncing problem suddenly gets more complex.

I agree that my usage of 'trivial' might be a bit premature. Basically, what I meant is a quite naive implementation which should - in theory - work independent of the complexity of the engine (the question whether it would work on practice is of course a quite different one).

A rendering loop involves sending commands to the GPU + waiting for the final input. The GPU is unable to start processing the next frame until it has processed that final input. However, you can start feeding it commands for the next frame before that - the driver will simply queue them until the GPU is free.
Now, imagine that you have two rendering contexts on different GPUs which share all shaders, textures, VBOs etc. After you submit the commands for one frame, you do your context.flush or swapbuffers or whatever, switch the context to the other GPU and do the rendering just as usual. If the driver does it thing correctly, there should be no sync issues. After all, changing a shader variable does not affect previous draw calls - even if these calls have not completed yet.

Agin, I must stress that this is all based on ideal drivers, hardware etc. We al know that drivers often do something they are not really supposed to. Also, some sharing scenarios are difficult or impossible for the driver to optimise. As I said before, if I had to develop a game engine, I wouldn't even invest any time in making it play with multiple GPUs - its just not worth the time currently. Max I would do is utilise the second GPU for some additional processing where precise timing synchronisation is not necessary — like computing global illumination or things like that (where you can get away with using the average of the last few frames).

I do think that it would be interesting to try and build something like above into an existing game engine and see what happens. I am in academia after all, curiosity is what its all about ;) So please don't reed too much in what I am writing here. Its all quite speculative and by no means definitive. I am very well aware that you have much more experience with actual game engines and better insight into driver limitations than me.

Actually that has some unwanted overhead based on a few tests, to get the best performance you can/should DMA the memory between the cards but without a MacPro I can't be sure the DMA method works as expected. However this is likely how Apple get their performance in FCP.

Hm, blitting or RTT via a FBO should be implemented as a DMA transfer by the driver anyway. Just of curiosity, how would you instruct the driver to perform an explicit DMA transfer? I am not aware of any OpenGL functionality along these lines. I know that OS X offers the IOSurface API but its so poorly documented and I never had a time to look at it properly.


Partly you also have to take into account the game engines, the problem is not writing a game engine from scratch that supports multiple GPUs but taking a very complex game engine that does not support multiple GPUs and modifying it to work with them. That is harder as many engineering decisions have already been made before you start and plenty of them are serial in nature.

[...]

The time spent on making a MacPro (or even a MacBookPro) faster will not benefit users with the lower performance Macs (or single GPUs). Making the fast machines faster at the expense of the slower ones is not a good decision so in terms of allocating programmer time. This is why multi GPUs historically have been on the "if we have time" list not the "important to fix before release" list. We constantly evaluate the situation but that's where I am coming from from a games developers/publishers standpoint.

This makes perfect sense.
 
The good thing with crossfire being automatic on Windows side is that you don't need to wait for game creators to decide to support it. Being so much of a hassle on the OS X side (since it needs so much manual implementation for this to work) makes me very sceptical whether this is going to be implemented in the near future.

Can't help, though, but think that it's just a waste having 2 GPUs while the one of them just sits idle doing nothing else than look pretty, as the nMP under bootcamp it has out-of-the-box crossfire functionality which greatly boosts the gaming performance - at least according to a few benchmarks available so far.
 
From programmer's standpoint, it is exactly the same thing. The code is the same (except performance tweaks). The only difference is the performance characteristics. You can certainly use the iGPU to prototype a multi-GPU rendering engine.

OK, not going to be accusatory as I am genuinely curious. How would such a thing be possible on systems where you have an Intel iGPU and NVIDIA/AMD dGPU? Further, wouldn't the nature of how Optimus and Enduro (not applicable to the Mac, obviously) work prevent this since all the data from the dGPU has to be routed through the iGPU while gaming?

AMD has/had asymmetric Crossfire, and I can certainly see how that works because it's the same company making the CPU and GPU. I had an Asus notebook with this setup, and it was a spectacular disaster. However, I'm not following where you're going if you're saying this would be possible with Intel + NVIDIA/AMD.
 
OK, not going to be accusatory as I am genuinely curious. How would such a thing be possible on systems where you have an Intel iGPU and NVIDIA/AMD dGPU? Further, wouldn't the nature of how Optimus and Enduro (not applicable to the Mac, obviously) work prevent this since all the data from the dGPU has to be routed through the iGPU while gaming?

AMD has/had asymmetric Crossfire, and I can certainly see how that works because it's the same company making the CPU and GPU. I had an Asus notebook with this setup, and it was a spectacular disaster. However, I'm not following where you're going if you're saying this would be possible with Intel + NVIDIA/AMD.

Its just like bridgeless SLI/Crossfire - you do stuff on different GPUs and the system handles copying it hence and forth. From programmer's standpoint, it works quite seamlessly. Again, can't comment on the performance as I never did any tests. Obviously you have to deal with problems like potentially different image quality, GPU capability and stuff like that. Here is a screenshot of a simple app (took literally 10 minutes of trial and error to modify an Apple-provided demo) I made to illustrate that it works - the active (connected to the output) GPU in this case is the IGP:

Dlk3Zcu.jpg


In practice of course, its less useful as the performance of the cards differs. I can imagine doing main rendering on the dGPU and offloading some secondary work to the IGP (as I mentioned in my previous posts).

The bottom-line here is: you can take a MBP and prototype some multi-GPU code, no problem. The tweaking etc. is a different thing, as Edwin very correctly points out. And of course, we live in a non-ideal world and this is especially true with graphics drivers. Sometimes a feature does not work or just does not do what it is supposed to do. I hope that this will get much better in a few years where GPU programmability reaches the level where we don't need a complex API anymore. Just give me something like GL_NV_vertex_buffer_unified_memory and friends and I will be very happy :)
 
In practice of course, its less useful as the performance of the cards differs. I can imagine doing main rendering on the dGPU and offloading some secondary work to the IGP (as I mentioned in my previous posts).

Isn't this something that the actual game engine has to support? Is such capability something that can be simply coded in without breaking other stuff?

In the end, of course, the aforementioned performance differences make this impractical, but it's still an interesting concept.
 
Isn't this something that the actual game engine has to support? Is such capability something that can be simply coded in without breaking other stuff?

As pointed out by Edwin, it is quite easy to in a new game engine, but might be quite tricky with an existing engine. Basically, what I have in mind are enhancement effects, especially those which do not nessesarily need to be recomputed every frame (certain shadows, illumination effects etc.). These can be scheduled for computation on a different OpenGL context (even if we are rendering on the same GPU), which makes it quite easy to 'plug' them into a different GPU if we want.
 
I don't think this is 4GB which is biting you here. What you are describing is a really serious stall, I have never experienced something like this with out machines (we have plenty of Minis running Mavericks which only have 4GB or less RAM).

Fortunately this has not happened since. Fingers crossed it will continue that way.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.