I'm not aware of any "secret sauce" functions in CUDA that would cause that to happen, especially with Metal 2 or OpenCL on Windows. I don't think there is any custom acceleration hardware that only CUDA has access to.
You might run into problems with Metal 1 or Apple's old OpenCL where you just couldn't do things and had to go a long way around, but Metal 2 has really beefed up the ability to write custom functionality at a low level, even if there isn't a 1:1 function map.
Short version: There isn't really any reason Metal or CUDA can't generate the exact same shader byte code.
Cuda is highly optimized toward its native plateform and fuctionalities. Cuda support only one brand of GPU. You can't directly port those to a foreign architecture and espect the same result or performance. There is no magic here and cross compiler aren't a new thing and they come with heavy limitations, like no optimization at all and having to rewrite most of the advance functionnality from scratch.
In short it isn't the same shader byte code, only one similar in result but not necessarly in performance.
We still have problem converting text and spreadsheets documents where rules a specifications are well known and published and yet you expect a third party to:
1- Have access to all the source code, including helpers libraries of the closed source Cuda
2- be able to replicate everything in another API that is on its last leg (OpenCL) and incomplete.
3- be able to certified that their translation as a 1:1 relation and optimisation to the original code.
And all this just to not give the option to their customer of using an NVidia card instead of an underperforming AMD one... A big waste of ressource IMHO.
[doublepost=1500427149][/doublepost]
What if what takes on CUDA architecture N+X cycle, takes N cycles on AMD GCN, in this particular case?
I addressed this in my last sentence... You'll get a gain in that case. But since the respective targeted architecture both have strenght and weakness and in many case a totally different way of doing thing, most of the time your translation system won't be able to give you that 1:1 relation which beside performance issue may also introduce difference in the result.