Does Apple have a chiplet design strategy at all?!
They pretty much have to. The fab processes are eventually force them into one. It is just a matter of how long to they want to evolve into one.
Lots of folks have chirped about how Apple is going to integrate the celluar modem into the main SoC on the phone side. That's actually isn't particularly a good path. Even chiplets (multiple dies in a package) at the phone is a better path for them to go down.
This is eye opening.
“Chiplets” seems like a very smart way of being able to improve “blocks” of an SoC in-line with the most recent technology advancements, without waiting for a years-long complete redesign of the entire die.
The way Apple is doing things with UltraFusion won't necessarily completely decouple the dies from one another.
It is more akin to hooking the internal mesh/bus/communications of the dies to one another. It is going to be tough to radically redesign one internal mesh and leave the other stuck in the 'stone ages'.
Chiplets can help in that the company is doing fewer designs so they can spend more time on the lower number of designs. It helps with economics also. AMD is using the same compute chiplet in Ryzen desktop as they are in Sever. So that is fewer and more die reuse but it isn't necessarily faster. AMD is on a more steady pace than Intel right now but that is far more so because not trying to do crazy , too large catch up in one jump technology leaps at a time. ( AMD is closer to doing tick/tock now than Intel. It is kind of like Intel complete tick/tock in the toilet and did the opposite. )
Chiplets done the right way might help Apple because they have been historically built to do a minimal number of designs better. Now that they are spread out over watch , phone , ipad , laptops , a broad range of desktops they appear to be struggling more ( some may blame it on Covid-19 but I have my doubts. It contributes though. )
Chiplets don't lead to maximum perf/watt. If Apple is fanatical about pref/watt then they are likely going to do chiplets bad. They don't have to abandon perf/watt , but they need to back off just a bit to get the desktop line up to scale well.
The intenal blocks of a monotholthic die can be made reasonably modular. If they have different clock/sleep/wake zones they are somewhat decoupled. I highly doubt there is some humongous giant technological slow down if do the functional decomposition the right way (chiplet or otherwise).
If the internal team dynamics and communications are horrible.... chiplets likely aren't going to help.
And if use bleeding edge 3D packaging to recombine the chiplet dies that has its own design and production planning/scoping overhead also.
I agree that Apple shouldn’t decouple the CPU, NPU, ML, GPU cores from the die unless — UNLESS — the performance and clock speeds of, say, GPU cores are being held back by the much slower clock speeds that CPU cores can only run at.
That is about backwards. The GPU is generally going to run at a slower clock than the CPU does. If the clocks of the CPU/GPU/NPU cores can be put to sleep to save power then the clocks are not hyper coupled to one another anyway.
Fair memory bandwidth sharing and thermal overload bleed from one zone to the next are more burdensome issues.
There are whole set of folks on these forums that have their underwear all in a twist that Apple isn't hyper focused on taking the single threaded drag racing crown away from Intel/AMD. I don't think folks are going to be happy. The whole memory subsystem based on LPDDR5 isn't really set up to do that. The CPUs clusters don't even have access to the entire memory bandwidth. So not sure why trying to get a top fuel funny car drag racer out of that.
Apple's bigger problem is keeping the core extremists happy. Only want 8 CPU cores and 800 GPU cores or only want 64 CPU cores and only 32 GPU cores. The other think is that the shared L3 cache. Apple is probably more sensitive to that leaving for another chiplet (at least if want to keep similar perf/watt targets). That is awkward because the SRAM isn't scaling. So the core/cache ratio is going to be tougher to keep if trying to hit the same cost price zone.
If Apple could design a GPU-only IC that could run at 10 GHz — and the required I/O memory bus/data bus is sufficiently fast, I say, Go for it! Especially if more and more General Purpose instructions can be performed on a GPU instead of the CPU. (Apple needs to strive much harder to find more and more GPGPU optimizations. Linux is way ahead of Apple in this pursuit.)
how going to make the memory bus got that fast with LPDDR memory as a basic building block?
GPUs don't run single threaded hot rod stuff well. They run massively and embarrasingly parallel data problems well. You don't need a few 10 GHz individual cores if have 100x as many slower cores. That is the point. If can't chop the problem up into a large number of smaller chunks then it probably shouldn't be on the GPU in the first place. That is a hammer driving a screw; wrong tool.
But I suspect Apple doesn’t want a situation where an Apple Silicon SoC design changes every six months, much to the confusion/frustration of developers and even Apple’s own OS/SDK engineering teams. It’s already the case that two years after the M1s release, MacOS software that claims it’s optimized to run on the M1 isn’t reeeeeeally as optimized to run on the M1 as it could be. (Including even Apple’s own core apps like Final Cut Pro.)
That is more so because Metal pushes more responsibility for optimization into the applications than OpenGL does. It is a dual edged sword. Sometimes app developers can squeeze out more performance than a bulky heavewight API could. But if fix the bulky API that fix is used by many apps. So the fixes roll out faster.
A contributing factor to why Apple probably took AMD/Intel Metal off the table for macOS on M-sers is that they dont' want developers to come up with their own time alloocation to spend on AMD fixes versus Apple GPU fixes. If there is only one GPU fixes to roll out the allocation is mostly done (at least for the macOS on M-series side of the code. Higher sales of new M-series Macs just makes the Intel side less interesting. But Apple can't make everyone go down to exactly zero. Unless it is an apple silicon only app. ).
If you think Chiplets are going to reduce design/test/validate/deploy validation cycles go down to 6 months I think you are just looking at the top of the iceberg. Chiplet are not necessarily going to make things move that fast.
This is the "Mythical Man Month". A woman takes 9 moths to gestate a baby so if get 9 women can get a baby out in 1 month. No.
If you take a 100Billion transistor monolithic chip and chop that into 10 10Billion chiplets it isn't necessarily going to go 10x quicker using chiplets. How that is decoposed , where the replicated portions are matters.
Frequent changes to Apple Silicon via regular “chiplet” improvements might present an ever ”moving target” that developers will be demotivated to code specifically for, knowing that a fundamental part of the architecture might change in 6 months.
For chiplets is might be more important than monolthiic that you measure twice and cut once. It isn't a mechanism for more rapidly throwing out designs that are not validated and tested at a more rapid pace. Or for frantically mutating opcodes. The external opcodes don't have to chage to get performance improvements. CPU/GPU don't have to directly execute the exact same opcodes that the programmers/compilers see.
I do realize that this is how it works already; I’m calling for even greater dedication on Apple’s part. Rosetta 2, for example, was Apple’s “Moon Shot” that Microsoft can only DREAM of ever accomplishing.
You mean throwing 32 bits apps out the window and telling your user base "tough luck, it is over" ? Yeah, Microsoft can't do that. Microsoft spent years and years on a aspect of translation that Apple largely just punted. Microsoft is going to support removable RAM and socketed CPU too. Largely because their user base is fundamentally different.
(As I understand it, Apple Rosetta 2 engineers even found plenty of Intel instructions that needed no translation at all to run on ARM.)
you mean like 2 + 2 = 4 . Shocker. Basic math operations are not the source of extremely difficult semantic mismatch problems between languages. Store standard data value 1234 at memory location 678910 isn't a huge semantic gap hurdle either.
There are easy and hard translations in all conversion problems.
If talking about exactly the same opcode binary encoding. It is odd , but if playing the same trick for the decoder perhaps not all that odd.
Rosetta 2 might have led to a lot of Apple engineer burnout, but Apple needs to find a way to motivate engineers to be as devoted and dedicated to “impossible” feats like this again.
Since there was a Rosetta 1, tagging Rosetta 2 as "impossible" isn't all that creditable. What was true was that Apple internally was out of practice internally of getting something like that done. Apple actually didn't do Rosetta. Many of the folks that did do 68K -> PPC weren't around. Apple had JIT compile skills internally but this is a bit different. It wouldn't be surprising if this took many months longer than the initial project plan said it was going to take.