Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
I'm not aware of any "secret sauce" functions in CUDA that would cause that to happen, especially with Metal 2 or OpenCL on Windows. I don't think there is any custom acceleration hardware that only CUDA has access to.

You might run into problems with Metal 1 or Apple's old OpenCL where you just couldn't do things and had to go a long way around, but Metal 2 has really beefed up the ability to write custom functionality at a low level, even if there isn't a 1:1 function map.

Short version: There isn't really any reason Metal or CUDA can't generate the exact same shader byte code.

Cuda is highly optimized toward its native plateform and fuctionalities. Cuda support only one brand of GPU. You can't directly port those to a foreign architecture and espect the same result or performance. There is no magic here and cross compiler aren't a new thing and they come with heavy limitations, like no optimization at all and having to rewrite most of the advance functionnality from scratch.

In short it isn't the same shader byte code, only one similar in result but not necessarly in performance.
We still have problem converting text and spreadsheets documents where rules a specifications are well known and published and yet you expect a third party to:

1- Have access to all the source code, including helpers libraries of the closed source Cuda
2- be able to replicate everything in another API that is on its last leg (OpenCL) and incomplete.
3- be able to certified that their translation as a 1:1 relation and optimisation to the original code.

And all this just to not give the option to their customer of using an NVidia card instead of an underperforming AMD one... A big waste of ressource IMHO.
[doublepost=1500427149][/doublepost]
What if what takes on CUDA architecture N+X cycle, takes N cycles on AMD GCN, in this particular case?

I addressed this in my last sentence... You'll get a gain in that case. But since the respective targeted architecture both have strenght and weakness and in many case a totally different way of doing thing, most of the time your translation system won't be able to give you that 1:1 relation which beside performance issue may also introduce difference in the result.
 
on its last leg (OpenCL) and incomplete.

i'm not really sure if "on its last leg" is the right choice of words:

wiki:

OpenCL 2.2
OpenCL 2.2 brings the OpenCL C++ kernel language into the core specification for significantly enhanced parallel programming productivity. It was released on 16 May 2017.

  • The OpenCL C++ kernel language is a static subset of the C++14 standard and includes classes, templates, lambda expressions, function overloads and many other constructs for generic and meta-programming.
  • Leverages the new Khronos SPIR-V 1.1 intermediate language which fully supports the OpenCL C++ kernel language.
  • OpenCL library functions can now take advantage of the C++ language to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
  • Pipe storage is a new device-side type in OpenCL 2.2 that is useful for FPGA implementations by making connectivity size and type known at compile time, enabling efficient device-scope communication between kernels.
  • OpenCL 2.2 also includes features for enhanced optimization of generated code: Applications can provide the value of specialization constant at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
  • Runs on any OpenCL 2.0-capable hardware (Only driver update required)
Future
The International Workshop on OpenCL (IWOCL) held by the Khronos Group.
When releasing OpenCL version 2.2 the Khronos Group announced that OpenCL would be merging into Vulkan in the future.


---
but yes, it appears openCL as we know it will be going away.
 
Last edited:
I addressed this in my last sentence... You'll get a gain in that case. But since the respective targeted architecture both have strenght and weakness and in many case a totally different way of doing thing, most of the time your translation system won't be able to give you that 1:1 relation which beside performance issue may also introduce difference in the result.
You should add: most important performance factor is actual throughput of the cores, not theoretical TFLOPs ;).

If AMD GCN has higher throughput of the cores than CUDA - you will see gains immediately. CUDA architecture Features can mitigate this advantage. Same goes the other way around for AMD.

As for the cross compiler. The architectures are actually pretty similar to the degree that HIP is translating 99.96% of the code from CUDA, directly to OpenCL. That 0.04% is your required optimization, if you want "all" out from your GPUs.

Of course, that 0.04% may be the most meaningful in terms of performance.
 
You should add: most important performance factor is actual throughput of the cores, not theoretical TFLOPs ;).

If AMD GCN has higher throughput of the cores than CUDA - you will see gains immediately. CUDA architecture Features can mitigate this advantage. Same goes the other way around for AMD.

As for the cross compiler. The architectures are actually pretty similar to the degree that HIP is translating 99.96% of the code from CUDA, directly to OpenCL. That 0.04% is your required optimization, if you want "all" out from your GPUs.

Of course, that 0.04% may be the most meaningful in terms of performance.

And here lies the problem some have... "HIP is translating" does not equal "The byte code is the same and with the same exact performance". It just mean that the code will run with the expected result. Optimization is a totally separated process from the translation, so the code could be 99.96% "translated" and 0% optimized.

As for your first point:

"You should add: most important performance factor is actual throughput of the cores, not theoretical TFLOPs"

This is a lesson AMD hasn't learn yet since they still try to sell their card based on theoretical performance while being destroyed by a one year old actual throughput performance of an NVidia card.
 
And here lies the problem some have... "HIP is translating" does not equal "The byte code is the same and with the same exact performance". It just mean that the code will run with the expected result. Optimization is a totally separated process from the translation, so the code could be 99.96% "translated" and 0% optimized.

As for your first point:

"You should add: most important performance factor is actual throughput of the cores, not theoretical TFLOPs"

This is a lesson AMD hasn't learn yet since they still try to sell their card based on theoretical performance while being destroyed by a one year old actual throughput performance of an NVidia card.
So when AMD card, with unoptimized drivers, and in unoptimized for it software performs better than Nvidia card with optimized software and drivers we call that AMD GPUs are worse. Good to know.

Back to topic. OTOY claims that the GPUs are considered CUDA devices. If they have tested it on M395X I think it will behave like GTX 980, because both GPUs have similar core counts, and I do not think OTOY did anything to change Register File handling in CUDA from 128 Cores, to 64 cores. That would create huge change for AMD GPUs.

How come? Throughput would increase twice in the execution pipeline. There are other features that also affect performance(store and load width of instructions, etc), so in the end, we would see 40-50% performance increase clock for clock.

Let me explain how GPU architectures work, again.

Nvidia with Kepler Have designed GPU arch to have 192 CUDA Cores per 256 KB Register File Size.
With Maxwell they have jumped from 192 Cores, to 128 cores per 256 KB Register File size.

At first glance, it looks like a downgrade right?

No. What this has done is it increased throughput of the cores, because they were less starved for resources.
When we are looking at gaming and the idea of Tile Based Rasterization people thought that TBR is increasing efficiency, but actually TBR has not increased the throughput. It increased the memory balancing, and its load, and culling capabilities. This increased Geometry performance, but was topped out by throughput of the cores, which were able to do more work each cycle - increasing efficiency.

This is why GTX 980 with 2048 CUDA Cores, and roughly 4.4 TFLOPs was able to outperform 5.5 TFLOPs GTX 780 Ti, that had 2880 CUDA Cores.
67800.png

67744.png

67745.png

Consumer Pascal GPUs have not seen per clock increase in performance, like Maxwell vs Kepler did, because they still share the same Maxwell 128 Core/256 KB RFS architecture. GP100 is proper 64 Core/256 KB RFS Architecture, so that is why we have seen per clock, and per core increase in performance over previous generation GPUs.


How does it look at AMD GPUs? Since GCN1 they were always 64 Cores/256 KB Register File Size in each Compute Unit. Actually I am just looking at GCN white paper and Registry File Size in GCN is flexible. It can vary from 128 KB to 256 KB. That is why they have had stronger throughput in even not optimized software, for this architecture. That is why AMD GCN architecture excels in Apple ecosystem, because not only it is faster per clock(but lower clocked this way), but also wider, and able to do more work, if your software is properly optimized.

What this means is: Even if OTOY will not optimize it specifically for GCN(they cannot, because of obvious reasons), the GPUs will not be worse than their direct competitors in core counts, at least from Core throughput perspective. There can be features that GCN does not have, and there is obviously the topic of the pipeline. Nvidia GPUs have longer pipeline hence is why they are able to clock higher, at lower power consumption. GCN have had always shorter pipeline, because it was able to do more work each cycle. The tradeoff for this is that it could never clock up high, and was using more power, especially in non-optimized software.

In general we both agree about the topic of optimization. But its way less work to do right now, with HIP, that it would be without it. It will be interesting to see the performance differences on GCN versus CUDA running CUDA code, but I don't expect huge difference in performance between similarly core counted GPUs. If we compare for example M395X, it will behave just like 2048 core clock Maxwell or consumer Pascal GPU, unless performance relies only on RFS, and not related to core count. Then AMD GPU should be much faster.
 
So when AMD card, with unoptimized drivers, and in unoptimized for it software performs better than Nvidia card with optimized software and drivers we call that AMD GPUs are worse. Good to know.

Back to topic. OTOY claims that the GPUs are considered CUDA devices. If they have tested it on M395X I think it will behave like GTX 980, because both GPUs have similar core counts, and I do not think OTOY did anything to change Register File handling in CUDA from 128 Cores, to 64 cores. That would create huge change for AMD GPUs.

How come? Throughput would increase twice in the execution pipeline. There are other features that also affect performance(store and load width of instructions, etc), so in the end, we would see 40-50% performance increase clock for clock.

Let me explain how GPU architectures work, again.

Nvidia with Kepler Have designed GPU arch to have 192 CUDA Cores per 256 KB Register File Size.
With Maxwell they have jumped from 192 Cores, to 128 cores per 256 KB Register File size.

At first glance, it looks like a downgrade right?

No. What this has done is it increased throughput of the cores, because they were less starved for resources.
When we are looking at gaming and the idea of Tile Based Rasterization people thought that TBR is increasing efficiency, but actually TBR has not increased the throughput. It increased the memory balancing, and its load, and culling capabilities. This increased Geometry performance, but was topped out by throughput of the cores, which were able to do more work each cycle - increasing efficiency.

This is why GTX 980 with 2048 CUDA Cores, and roughly 4.4 TFLOPs was able to outperform 5.5 TFLOPs GTX 780 Ti, that had 2880 CUDA Cores.
67800.png

67744.png

67745.png

Consumer Pascal GPUs have not seen per clock increase in performance, like Maxwell vs Kepler did, because they still share the same Maxwell 128 Core/256 KB RFS architecture. GP100 is proper 64 Core/256 KB RFS Architecture, so that is why we have seen per clock, and per core increase in performance over previous generation GPUs.


How does it look at AMD GPUs? Since GCN1 they were always 64 Cores/256 KB Register File Size in each Compute Unit. Actually I am just looking at GCN white paper and Registry File Size in GCN is flexible. It can vary from 128 KB to 256 KB. That is why they have had stronger throughput in even not optimized software, for this architecture. That is why AMD GCN architecture excels in Apple ecosystem, because not only it is faster per clock(but lower clocked this way), but also wider, and able to do more work, if your software is properly optimized.

What this means is: Even if OTOY will not optimize it specifically for GCN(they cannot, because of obvious reasons), the GPUs will not be worse than their direct competitors in core counts, at least from Core throughput perspective. There can be features that GCN does not have, and there is obviously the topic of the pipeline. Nvidia GPUs have longer pipeline hence is why they are able to clock higher, at lower power consumption. GCN have had always shorter pipeline, because it was able to do more work each cycle. The tradeoff for this is that it could never clock up high, and was using more power, especially in non-optimized software.

In general we both agree about the topic of optimization. But its way less work to do right now, with HIP, that it would be without it. It will be interesting to see the performance differences on GCN versus CUDA running CUDA code, but I don't expect huge difference in performance between similarly core counted GPUs. If we compare for example M395X, it will behave just like 2048 core clock Maxwell or consumer Pascal GPU, unless performance relies only on RFS, and not related to core count. Then AMD GPU should be much faster.

Again, you are losing the forest for the trees...
About 95% of that wall of text as nothing to do with what we are talking about.

And I went out of my way to even say that it can play both way, but as usual you feel the need to attack anyone with an opinion other than your own. I told you to stop it 3 time already when adressing me. Next time I'll report.

Both brand of GPU uses totally different architecture, with different op codes and registers. While they both display the same graphics on screen or give you the same results in a compute task, THEY ARE TOTALLY DIFFERENT INTERNALlY IN HOW THEY ACHIEVE THOSE RESULT! If they weren't then NVidia would sue the hell out of AMD or vice/versa.

And when higher level api calls are compiled they take into account the targeted architecture. If NVidia internally use n number of steps to do a task and AMD uses n+x then NVidia come on top. The reverse also applies. And it is also the point I was making vis a vis GoMac originally, that the translated code isn't 1:1 compared to the original because it can't. It could only be if the code was running on the exact same architecture and using the same API or if it was targeting an intermediate machine like the jvm or Microsoft IL.
 
taking your POV and trying to impose it as some sort of 'view of the pros' is entirely off base and out of touch.

I never said "all pros", I said there are pros out there that don't need the overly expensive Xeons and ECC. Thanks for putting words in my mouth.

And yes, frankensteining a computer together. Freaking all computers are frankensteined. Components come from a host of different manufacturers, regardless who you buy your computer from. But yeah. Thanks for the note. ;)
 
  • Like
Reactions: flat five
Again, you are losing the forest for the trees...
About 95% of that wall of text as nothing to do with what we are talking about.

And I went out of my way to even say that it can play both way, but as usual you feel the need to attack anyone with an opinion other than your own. I told you to stop it 3 time already when adressing me. Next time I'll report.

Both brand of GPU uses totally different architecture, with different op codes and registers. While they both display the same graphics on screen or give you the same results in a compute task, THEY ARE TOTALLY DIFFERENT INTERNALlY IN HOW THEY ACHIEVE THOSE RESULT! If they weren't then NVidia would sue the hell out of AMD or vice/versa.

And when higher level api calls are compiled they take into account the targeted architecture. If NVidia internally use n number of steps to do a task and AMD uses n+x then NVidia come on top. The reverse also applies. And it is also the point I was making vis a vis GoMac originally, that the translated code isn't 1:1 compared to the original because it can't. It could only be if the code was running on the exact same architecture and using the same API or if it was targeting an intermediate machine like the jvm or Microsoft IL.
I have no idea where you see an attack there. I am just adding to what you have said. Why do you always have to believe that I am attacking anyone?

GPU architectures on fundamental level are quite similar, they just achieve things in different ways, also have different features, which may or may not be available on other vendors as well. Its exactly what you have written, only written in more details, from low-level detail. Do you see attack here?

I actually described how Nvidia gets increased throughput of the cores which is most important, for GPU functioning and its pipelines, ergo its performance. Features are what differ GPU architectures, and allow for increasing the performance.

CUDA code always have set of instructions which are prefetched and defined by Register File size. Its always 256 KB, per clock cycle. If you have architecture that has different Register File size - you have to optimize your code, because you will get boarder conflict. If your architecture has the same 256 KB Register File Size per clock cycle - you will not see any difference and you will not get any conflict - which is exactly why OTOY is able to run CUDA code on AMD GPUs, in the first place. The amount of cores, that are working on that 256 KB File size always will define your throughput and your GPU performance.

Yes, you always have to optimize. If you have application that solely relies on Compute throughput, even if you run CUDA code, only latest Nvidia GP100, and GV chips will be able to compete with GCN, because both GPU architectures are sharing 64 Core/256 KB Register File Size layout. The only question here is how shorter GCN pipeline will affect CUDA code performance.
 
Cuda is highly optimized toward its native plateform and fuctionalities.

Let me clarify:
I know CUDA somewhat well. I've never seen anything that is unique to Nvidia hardware. Maybe I'm missing something more advanced. If so, proof of specific functionality should be provided by people claiming that.

I think the other misconception going on here is that Nvidia GPUs don't run CUDA. They run Nvidia shader byte code. That's it. A CUDA app or compiler takes CUDA and compiles it into Nvidia shader byte code. The only reason CUDA can't run on AMD hardware is because no one wrote a CUDA compiler for AMD.

Now someone wrote a CUDA compiler to AMD. The AMD CUDA compiler works the exact same way the Nvidia one does. There's no lost performance lurking anywhere. There shouldn't be any real difference in capabilities. All someone did was write a compiler for AMD from CUDA that Nvidia refused to write.

FWIW Koyoot sounds more familiar with GPU internals than I am, but everything he is saying also makes perfect sense for why there wouldn't be any hidden performance penalties lurking.

GPU compute languages are typically fairly generic. The only thing that makes CUDA unique is vendor lock in and the amount of tooling, existing code, and ease of use (and on ease of use I'm not sure Nvidia has that advantage any more.)

CUDA also supports other architectures like x86 already, making the missing AMD support even more clearly anticompetitive and less about special Nvidia hardware magic.
 
  • Like
Reactions: koyoot
but just so we're clear-- do you think Windows should also let CUDA into their garden (your words)? or is CUDA support fine as is on Windows and it's only Apple who should license nVidia's software and have it included in every macOS (and i assume) iOS install?.

I cannot talk much about windows since I don't care that much about that OS (for me it's always just the best gaming platform and that's that). However, I don't think MS lately is so narrow-minded as apple is (a mentality that tries to pass to their users as well - no pun intended, generally speaking). They always support OpenGL and Vulcan along with their own DirectX, they've implemented a linux subsystem, eGPUs are supported looong before apple even mentioned them etc. etc. They are not negative regarding choices, they understand what a computer is supposed to be.

So, again, I really don't get all the negativity in this thread regarding nvidia and everything related to them (unless it is translated for amd which, then, it's fine). If someone started to read this thread without any other knowledge he would get the impression that AMD is the messiah of graphics for macs.
[doublepost=1500525029][/doublepost]
Please don't close this thread for a fight over GPUs again... :D:cool::rolleyes:

I don't see any reason why the gpu talk is dangerous for the thread. GPUs is a very important part of any upcoming mac. There will always be different opinions; but that's not a bad thing.
[doublepost=1500525269][/doublepost]
There is choice. Apple can use either Nvidia and AMD GPUs. But there is no choice of API you can use on Apple ecosystem - Metal 2.

Great news ! So where can I buy a Mac with nvidia ? :p
 
I don't see any reason why the gpu talk is dangerous for the thread. GPUs is a very important part of any upcoming mac. There will always be different opinions; but that's not a bad thing.

Evidently, you haven't been around much on the Forum. This is not the first thread on the Mac Pro.
I was just teasing of course, but given the past discussions... it wasn't without reasons. :D

Confrontation is always nice, but trying to prevail in a competition that has no winner, becomes pathological. (and I am not referring to the present discussion!)

There is no GOOD card or BAD card in absolute terms. There are so many aspects that fit so many different user cases, plus there are political (market, contracts, TOS...) and financial reasons why a company such as Apple decides to go for a supplier or another (or both).

I follow with interest all the reasons of either part, I just hope it doesn't become a longer-dick competition rather than an informative confrontation.

Again... just teasing, given "our" past experiences here. ;)

@
 
Last edited:
  • Like
Reactions: antonis
Accepting only Metal 2 is in its own way, effectively, back to a Vendor Lock.

And while from a technological standpoint this example (Metal 2) may be utterly wonderful & desirable for all software vendors to transition to ... it isn't a free lunch. Implementation invokes program management problem for all engaged parties (eg, suppliers) - and that's not merely the cost, but also the schedule to implement.

And from my very casual monitoring of Apple's behavior over the decades, they have a history of introducing the 'next great thing', but gets impatient and often can curtail the legacy prematurely. While this does 'force' suppliers to respond to innovation, it also leaves customers in the lurch: its counterproductive to be an early adopter because the toolbox is incomplete - - and by the time the pieces are in place to effect a positive transition (with minimal disruptions), the window is artificially narrow, which may not align with our own in-house production schedules & priorities which lack tolerance for potential risk from workflow disruptions.

And this is not some unknown, profoundly new realization - its the old "Better is the Enemy of Good Enough", and it is ultimately the Free Market which decides if "changes for the better" are really worth their asking price. Not Apple.

-hh
 
Accepting only Metal 2 is in its own way, effectively, back to a Vendor Lock.

And while from a technological standpoint this example (Metal 2) may be utterly wonderful & desirable for all software vendors to transition to ... it isn't a free lunch. Implementation invokes program management problem for all engaged parties (eg, suppliers) - and that's not merely the cost, but also the schedule to implement.

And from my very casual monitoring of Apple's behavior over the decades, they have a history of introducing the 'next great thing', but gets impatient and often can curtail the legacy prematurely. While this does 'force' suppliers to respond to innovation, it also leaves customers in the lurch: its counterproductive to be an early adopter because the toolbox is incomplete - - and by the time the pieces are in place to effect a positive transition (with minimal disruptions), the window is artificially narrow, which may not align with our own in-house production schedules & priorities which lack tolerance for potential risk from workflow disruptions.

And this is not some unknown, profoundly new realization - its the old "Better is the Enemy of Good Enough", and it is ultimately the Free Market which decides if "changes for the better" are really worth their asking price. Not Apple.

-hh
thing is, they don't appear to be ditching the Apple standard which most of macOS and software for Mac is based on (anytime soon).. that being openGL.

the 'next best thing' in this case is something like Vulkan which is also new.

it appears instead of adopting the next best thing and accepting all the Mac specific traps they may fall into (similar has occurred with openGL), they'll write their own graphics API as a means of moving forward.

the story you tell of Apple's past behavior (in regards to graphics) could be openCL.. they pushed that pretty hard at first but i believe ran into walls early on regarding how mac specific the standard would become or how much unnecessary baggage would be coming along with it since it's generically written to accommodate all softwares/hardwares instead of being optimized specifically for the uses Apple sees in their future.

---
saying another way-- i personally wouldn't liken Metal to ,say, ditching USB-A for USB-C on the laptops.

similar in ways, sure. but end users may end up with a huge benefit down the line regarding efficiency/speed/features(maybe)/mac optimization.

(and if they totally blow it, there's always Windows to fall back on ;) )
 
Last edited:
Accepting only Metal 2 is in its own way, effectively, back to a Vendor Lock.

And while from a technological standpoint this example (Metal 2) may be utterly wonderful & desirable for all software vendors to transition to ... it isn't a free lunch. Implementation invokes program management problem for all engaged parties (eg, suppliers) - and that's not merely the cost, but also the schedule to implement.

And from my very casual monitoring of Apple's behavior over the decades, they have a history of introducing the 'next great thing', but gets impatient and often can curtail the legacy prematurely. While this does 'force' suppliers to respond to innovation, it also leaves customers in the lurch: its counterproductive to be an early adopter because the toolbox is incomplete - - and by the time the pieces are in place to effect a positive transition (with minimal disruptions), the window is artificially narrow, which may not align with our own in-house production schedules & priorities which lack tolerance for potential risk from workflow disruptions.

And this is not some unknown, profoundly new realization - its the old "Better is the Enemy of Good Enough", and it is ultimately the Free Market which decides if "changes for the better" are really worth their asking price. Not Apple.

-hh
If you have a choice of using a GPU compute with proprietary API, that gives you higher performance, in lower power thermal envelope - would you not use it? Does it sound somewhat familiar?

Would you not go for this solution? Only if Nvidia offer you this. If Apple offers this on their ecosystem, and best results you get on this ecosystem with AMD GPU, everybody goes mad. Even if they outperform comparable Nvidia GPU.

But what I can expect from forum where professionals judge GPU compute usefulness based on its gaming performance?
 
Evidently, you haven't been around much on the Forum. This is not the first thread on the Mac Pro.
I was just teasing of course, but given the past discussions... it wasn't without reasons. :D

Confrontation is always nice, but trying to prevail in a competition that has no winner, becomes pathologic. (and I am not referring to the present discussion!)

There is no GOOD card or BAD card in absolute terms. There are so many aspects that fit so many different user cases, plus there are political (market, contracts, TOS...) and financial reasons why a company such as Apple decides to go for a supplier or another (or both).

I follow with interest all the reasons of either part, I just hope it doesn't become a longer-dick competition rather than an informative confrontation.

Again... just teasing, given "our" past experiences here. ;)

@

So true, all of the above.
 
  • Like
Reactions: askunk
Since iMac Pro is getting the same 10bit, 500 nits panel as current iMacs do, I suppose Apple will release an external 5k HDR, 1000 nits display for the mMP, but it is compatible with iMac Pro and anything that has TB3 too. And my guess is that it has a built in eGPU (three CTO options): Pro 580, Vega 56 and Vega 64.

This is part of the modular future of Pro Mac's.
 
And it infuriates me to end where Apple just assumes everyone need a freakin' display, mouse, keyboard, mix, and camera, when I already have all those things and just need an updated tower/mini-tower to plug into my current display. :p

But when people spec out a machine for the same price as a 5K iMac, they can add a display, an 8-core CPU, and 1080ti (or Vega), and all the peripherals, two m.2 drives, 32GB of RAM etc. And if they were going to spend $5,000, it could get insane - considering not all pro users need a Xeon processor nor ECC memory.

While I know that ECC corrects memory errors, does the non-ECC memory of the GPU ultimately rule on export/render?
 
Accepting only Metal 2 is in its own way, effectively, back to a Vendor Lock.
Not at all, you are locked with few code optimizations, in my case all my compute-off-load code is written either in C/C++ then optimized for Cuda/OpenCL (later we will try on Metal2 or Vulkan->Metal2).

What I like to work with cuda (besides performance/efficiency) is its toolchain, a lot easier that working with OpenCL (actually I code/debug/optimize first on CUDA, then from that cuda optimized/debuged code I build the OpenCL version, ...if required, most compute rigs use cuda, few exceptions use am/intel xeon phi).
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.