Very useful info. Thanks! Also, I would to know about Cuda cores and Steam processors. They're essentially unified shaders, but I have read they cannot be compared apples to apples. I read on enthusiast websites that Nvidias shaders are larger, more complex, so they do more per cycle. So if the AMD card has 5,000 shaders, the nVidia card can achieve the same with a lot less shaders, is this correct? Someone did some conversion with Maxwell and said for every Cuda core it takes like 2.25 steam processors.
This is were things get unnecessarily complex, simply because of the marketing BS. All of the modern GPUs are essentially multiprocessor units. An Nvidia pascal GPU is built from dual-core processors (what they cal SM), with two 256-bit vector ALUs per core. Similarly, AMD Vega is built from NZUs, which are processors with either 2 256-bit ALUs or 4 128-bit ones (the references I had on this were not clear). Blocks of these processors are then bundled together in groups sharing memory/cache/texture units access. If the CPUs were marketed the same way, then the Coffee Lake in my computer would be a "96 high performance FMA compute cores + 160 integer cores" or similar nonsense.
I have no idea if the capabilities of the ALUs is exactly the same these days, I assume it is very similar at least. Vega boasts with this thing they call "rapid packed math", which essentially means that its ALUs can process vectors of different types without performance penalty — something that Intel GPUs for example have supported for ages. If I understand correctly Pascal has the same capability, so here is probably no difference here. At any rate, a "core" (meaning one ALU lane) on both Vega and Pascal is capable of 2 32-bit floating point operation per second, but that number is a bit of a hoax, since it simply reflects the fact that they can do an addition and a multiplication in a single cycle (fused multiply-add or FMA). If all you do is additions or multiplications, its only 1 FLOPS.
What is quite important with GPUs though is the granularity of execution. The thing is, every ALU can do only one thing at a time. You can't have a part of an ALU adding two numbers and another part multiplying them. These are systems that are designed to process a lot of data at the same time using the same operation. This is great for graphics, where you usually want to apply the same operation in parallel (say, blend two images together). But this is also were the granularity matters. A GPU might have very fast 1024-bit ALUs capable of processing 32 FP numbers at the same time, but if you don't have that many numbers to process, parts of the ALU simply won't do anything useful. If you want to add 16 numbers and then subtract 16 additional numbers, a 1024-bit ALU would need two passes (two cycles). Two 512-bit ALUs instead will need only one cycle.
From what I gather, Vega (and also Intel GPUs) are more flexible in how they do scheduling compared Nvidia's architecture, which in turn allows them to pack work on multiple smaller, complex tasks more efficiently. This doesn't matter much for games or any other work where you are basically processing large arrays in parallel, but it makes a difference when you need to run less predictable code. This is also why Vega is performing so well in complex compute tasks such as raytracing. But I admit that my knowledge on the topic is very basic.
[doublepost=1541513694][/doublepost]
I'm astonished that the new Vega graphics will be 60% faster than the 560X - can anyone explain how this is suddenly possible despite the strict power and thermal limits in the MBP?
We had some discussion on this on the previous pages. Essentially, a more efficient new architecture + HBM2 RAM that consumes significantly less power compared to GDDR5 while offering much higher bandwidth. The later in turn means that more the GPU can be clocked higher to "compensate" those power savings.