M4+ Chip Generation - Speculation Megathread [MERGED]

Pressure · Jan 7, 2025

crazy dave said:
Not quite sure how big exactly the GPU is from the "1 Petaflop of FP4 compute". I know they're marketing it for AI purposes, but I'd be very curious about its standard FP32 TFLOPs with that much RAM.

Hopefully more to come from Nvidia this year ... the CPU on this thing won't be terribly competitive, but the GPU could be ... and it could point to more exciting things in the long rumored standard desktop/laptop ARM SOC Nvidia is said to be shipping later this year.*

(As for the Apple server chip stuff, my strong suspicion - perhaps wrong of course - is that there will continue to be significant overlap between the chips Apple uses internally for its servers and those it sells as Ultras to its customers)

*Edit: CPU may be better than I thought. Originally I had read it would be Neoverse V2 cores like the big Blackwell superchip. However, apparently it will be 10 X925 and 10 A725. Those should be much, much better!

Arm-powered NVIDIA DGX Spark Puts High-Performance AI in the Hands of Millions of Developers

DGX Spark featuring Arm-powered CPU cores makes it possible for every AI developer to have a high performance AI system on their desk.

newsroom.arm.com

I don't know if they will be competitive with the M4 Max, certainly not in ST unless very upclocked and it isn't clear to me that they can be. According to Anandtech max clock speed may be 3.8Ghz:

Arm takes aim at client PCs with new 3nm compute subsystems, offering pieces of its IP to its customers for desktops, laptops, and tablets

Arm's new compute subsystems can enable PCs and tablets running Android, Linux, and Windows.

www.anandtech.com

That may not be a hard limit and even so, it's much more exciting than I had originally thought! The ST performance will be okay and with 20 cores it should have decent multicore performance, especially since the focus should be on the GPU. Still though, Nvidia and everyone else making the Digits announcement is being a little too coy with the GPU's specifications for my liking. Hopefully, they're good.

It comes with 128GB LPDDR5X (unified memory) for $3,000.00 which once again makes Apple's pricing look ridiculous in comparison.

The same from Apple is going to cost $4,799.00 in the Mac Studio or $4,699.00 from the MacBook Pro.

Boil · Jan 7, 2025

Pressure said:
It comes with 128GB LPDDR5X (unified memory) for $3,000.00 which once again makes Apple's pricing look ridiculous in comparison.

The same from Apple is going to cost $4,799.00 in the Mac Studio or $4,699.00 from the MacBook Pro.

It also comes with a 4TB SSD, so that would be a base M2 Ultra Mac Studio with upgraded RAM & SSD (128GB RAM/4TB SSD) for US$5,799...

crazy dave · Jan 7, 2025

Pressure said:
It comes with 128GB LPDDR5X (unified memory) for $3,000.00 which once again makes Apple's pricing look ridiculous in comparison.

The same from Apple is going to cost $4,799.00 in the Mac Studio or $4,699.00 from the MacBook Pro.

I would slow down on that criticism just a bit. The 128GB of RAM will probably not cost that much for the Studio once the M4 comes out - I take it you took the M2 Studio 128GB price? Remember Apple makes you upgrade to the Ultra to get that in the M2 generation and won't for the M4 generation as M4 Max goes up to 128GB which the M2 Max did not. So yes the M2 Studio is a bad buy right now for many, many reasons. No question. And the MacBook Pro is full laptop with a screen. I mean the mini costs $800 to get the same specs as the starting $1600 price of the 14" MBP and M2 Studio was a $900 - $1100 cheaper than the base M2 Max laptop when they were both new (again if you want to match SSDs). Of course we don't know when the M4 Studio will be out - probably summer a month or so after DIGITS releases in May.

Of course the M4 Studio will doubtlessly be more expensive than $3K still to get 128GB of RAM never mind the 4TB SSD. I'd say $3700 - $4K depending (again not counting hard drive). That said, we also don't know the performance specs of this machine. The DIGITS CPU is almost certainly worse than M4 Max by a fair margin (though again not as bad as I thought it was). But the star of the show will be the GPU. Given the size of the device in pictures, I'm thinking thermal output won't be that high, so the GPU is probably not that big, especially since they didn't explicitly say how powerful it was - just the FP4 sparse metric. So it's likely going to be mostly an AI-focused device with high memory bandwidth and high memory capacity rather than lots of compute - thus 128GB but on a machine with much more limited purpose than a Studio. Think more M4 Pro mini levels of performance but with 128GB of RAM available and probably much better bandwidth. If I am right, take the M4 Pro mini 64GB and tack on $800 for the extra 64GB (what Apple charges on the M4 Max to go from 64GB to 128GB) and you get ... $3000 (albeit with a 512GB hard drive). But I could be wrong, so I'm definitely looking forward to seeing it. Hell I might even be interested in it even if I'm right (though I would hold off to see what else comes down the pike from Nvidia).

Also, if anything this is a reaction to Apple's moves in the laptop and desktop space. As bad as we think Apple's upgrade RAM pricing is, it's nothing compared to Nvidia's GPU VRAM pricing past 24GB (now 32GB in the 5090). Previous to DIGITS, Nvidia's professional line of GPU were the only way to get more and would often cost double the price of the Mac just for the GPU. The cheapest, current Nvidia GPU I know of with 48GB of RAM (and it is only 48) is the RTX 6000 (5580 is China only) and it sells for $6,800 (although that's MSRP I don't know about prices at retail). And its compute is only a little better than a 24GB 4090 (whose MSRP is $1,500 but it's almost always more than that at retail). That makes Apple look fantastic by comparison.

That's why we saw a lot of machine learning folks interested in mostly inference picking up Apple devices which had much better memory capacity and bandwidth than what they could get from a comparably priced Nvidia machine even if they couldn't do the training on it. That Nvidia would see this too and react is one reason I thought Apple would push harder to incorporate greater matrix multiplication into its GPUs earlier. They may yet still, but the longer they wait, the less of a splash it'll be.

Boil said:
It also comes with a 4TB SSD, so that would be a base M2 Ultra Mac Studio with upgraded RAM & SSD (128GB RAM/4TB SSD) for US$5,799...

I mean yeah.

novagamer · Jan 7, 2025

Right, and for $6k you can natively link 2 of them, run Linux, get 256gb of memory, 8TB of storage total, and all of the nvidia developer tools that most researchers are using anyhow.

Even buying them to remote into is not the worst thing for that type of development and I think it will really hurt Apple on the highest-end for on-premises inference work where there are a few companies who are currently tuning models with M2 Ultras.

If the Hidra is real I definitely expect Apple to ship it now, they need to get 512GB out in 2025 if possible even if it costs $10,000. Competition is great.

I watched their keynote and think they have been working on this for a while – they are Apple’s largest competitor right now and into the next few years IMO. I see why Apple has been working with them again on the research end of things, particularly with Vision Pro featured a couple times.

crazy dave · Jan 7, 2025

novagamer said:
Right, and for $6k you can natively link 2 of them, run Linux, get 256gb of memory, 8TB of storage total, and all of the nvidia developer tools that most researchers are using anyhow.

Even buying them to remote into is not the worst thing and I think it will really hurt Apple on the highest-end for on-premises work for the few companies who are currently tuning models with M2 Ultras. If the Hidra is real I definitely expect Apple to ship it now, they need to get 512GB out in 2025 if possible even if it costs $10,000. Competition is great.

I watched their keynote and think they have been working on this for a while – they are Apple’s largest threat especailly

I'd turn that around. Nvidia in this space is the market leader. Apple may be a potential threat to Nvidia's dominance (though not in big iron). But it's Nvidia who rules the roost. So you could say that Apple is under pressure to keep being a threat, but it's they who are putting Nvidia under pressure to put out new devices and even putting them under pressure value-wise ... which almost feels dirty to say.

thenewperson · Jan 7, 2025

crazy dave said:
I'd turn that around. Nvidia in this space is the market leader. Apple may be a potential threat to Nvidia's dominance (though not in big iron). But it's Nvidia who rules the roost. So you could say that Apple is under pressure to keep being a threat, but it's they who are putting Nvidia under pressure to put out new devices and even putting them under pressure value-wise ... which almost feels dirty to say.

Yeah this framing makes more sense. I hope Apple responds with better RAM/storage pricing though. It’s been ridiculous for too long

JouniS · Jan 7, 2025

crazy dave said:
Not quite sure how big exactly the GPU is from the "1 Petaflop of FP4 compute". I know they're marketing it for AI purposes, but I'd be very curious about its standard FP32 TFLOPs with that much RAM.

Nvidia's newly announced GPUs are marketed with "AI TOPS". If that's the same quantity, the Digits GPU should be comparable to a desktop 5070 or a laptop 5070 Ti. Those come with 6144 and 5888 CUDA cores, which should be equivalent to 48 and 46 Apple's GPU cores.

name99 · Jan 7, 2025

crazy dave said:
I'd turn that around. Nvidia in this space is the market leader. Apple may be a potential threat to Nvidia's dominance (though not in big iron). But it's Nvidia who rules the roost. So you could say that Apple is under pressure to keep being a threat, but it's they who are putting Nvidia under pressure to put out new devices and even putting them under pressure value-wise ... which almost feels dirty to say.

Lotsa people who seem to think compute is just one undifferentiated pool of sameness, like buying a heater...

Digits is NOT a consumer product. It doesn't ship with a consumer OS. It will be valuable to people who want that sort of workstation, but it will be competing against Linux boxes that used to hold a Xeon, not against Macs.
It shows the inexorable advance of ARM, and it's bad news for x86. But it's kinda irrelevant for Apple.

crazy dave · Jan 7, 2025

JouniS said:
Nvidia's newly announced GPUs are marketed with "AI TOPS". If that's the same quantity, the Digits GPU should be comparable to a desktop 5070 or a laptop 5070 Ti. Those come with 6144 and 5888 CUDA cores, which should be equivalent to 48 and 46 Apple's GPU cores.

It is and it isn't. That's why they specifically said sparse FP4, not TOPS. Now Nvidia has used FP4 to market gains in TOPS but aren't in the case of the consumer level hardware as far as I can tell. And this isn't consumer level hardware. To give some context, we can see the data sheets here:

NVIDIA Blackwell Architecture Technical Overview

NVIDIA's Blackwell GPU architecture revolutionizes AI with unparalleled performance, scalability and efficiency. Anchored by the Grace Blackwell GB200 superchip and GB200 NVL72, it boasts 30X more performance and 25X more energy efficiency over its predecessor.

resources.nvidia.com

So the B100 has 14 petaflops of sparse FP4 compute and the B200 has 18. They have 60-80 TFLOPs of FP32 compute respectively. GB200 is 2 GPUs and has 40 PFLOPs of FP4 and 180 TFLOPs of FP32. I think the GB10 GPU in DIGITS is the same family of GPUs, but even if it's Blackwell 2 (like the upcoming consumer GPUs), I don't think that changes much.

Based on the above, we would expect the GB10 to have 1/14 the compute capacity of the B100, 1/18 of the B200, and 1/40th of the B200. For all of them, that gives it an FP32 TFLOP of roughly 4.5. That's actually even base M4 GPU level. But it's possible that the relationship between FP4 and FP32 isn't the same in the B10 (i.e. some matrix units are disabled so more FP32 per sparse FP4) or I've done something wrong.

But even if I have, you can even see in the pictures, this is a very small device. It isn't going to be fitting a Max-level GPU. That's probably going to be an M4 Pro-level GPU at best. It is not meant for large compute projects, it IS meant as development and inference tool with huge VRAM pools. Because again for inference, bandwidth and memory capacity matter a lot more than compute.

name99 said:
Lotsa people who seem to think compute is just one undifferentiated pool of sameness, like buying a heater...

Digits is NOT a consumer product. It doesn't ship with a consumer OS. It will be valuable to people who want that sort of workstation, but it will be competing against Linux boxes that used to hold a Xeon, not against Macs.
It shows the inexorable advance of ARM, and it's bad news for x86. But it's kinda irrelevant for Apple.

I'm not even really sure it competes against that to be honest - if I'm doing my math right, DIGITS is pretty small. Although I guess back in the day the smallest Xeons weren't very big either. So that point still holds I suppose. But you're absolutely right about the focus of this device versus the Studio. The Studio is a general purpose consumer/"prosumer" product and DIGITS is definitely not based on hardware and software. DIGITS is in some ways a souped-up Jetson AGX Orin DevKit (with a potentially smaller, but more advanced GPU, much bigger and better CPU, double the memory pool, actual storage, etc ...).

That said, I would still say that it isn't irrelevant to Apple, because Apple actually provided a threat here. Consumer/prosumer hardware like Apple ships is always a threat to professional hardware because it tends to be cheaper, more familiar, and more flexible. If it can (I making the numbers up) do 70% of the same job for 40% of the cost (and not be a unitasker, capable doing other things) ... suddenly people are buying that commodity hardware instead of the professional one and the cheaper consumer hardware begins to replace the more expensive professional hardware. Now Apple would probably balk at being called commodity hardware, but there's no question in my mind that DIGITS is a direct response to all these ML people suddenly going "hey you know what's kind of shocking? These Maxes and Pros from Apple are actually really good and cost effective for inference! And it can just be an everyday computer as well!"

Even the M4 competitor Strix Halo* was making that point in their marketing where their chip that goes up to 128GB of RAM could also outpace Nvidia consumer hardware and challenge professional hardware in inference when the model size outstripped the Nvidia's memory capacity.

This device to me is thus a reflection that Nvidia needed a relatively cheap inference box (that was bigger/more capable than the Jetsons) to compete against consumer hardware that was in danger of pulling people out of the Nvidia/CUDA ecosystem. Apple had a nice system for that and had carved itself out a niche market for LLM inference so I suppose one could say "Nvidia threatens Apple's LLM market niche and thus its ability to threaten Nvidia in LLM-space" but I'd prefer to think of it as staving off Apple and consumer hardware in general.

=============

*AMDs marketing is definitely comparing with the M4 Pro and at both CPU and GPU level I'm thinking that's probably indeed where the performance will be at for the Strix Halo both based on leaked and advertised benchmarks for the CPU and node/core core count for the GPU relative to the integrated 890M and the 890M's performance profile. The Apple machine will have several advantages: better CPU cores with better efficiency and ST performance and GPU cores on a better node with better efficiency and better RT (Strix Halo is RDNA 3.5 not 4), but obviously many games will be optimized for Windows PCs and so will be better there. And the M4 Pro caps out at 64GB of RAM while, like DIGITS, the AMD chip goes to 128GB. Bandwidth is the same as the M4 Pro or similar enough depending on what lpDDR5x memory modules they go with (both have a 256bit bus). Obviously pricing is not known yet and it's be fascinating to see how AMD/OEMs prices their RAM upgrades above 32GB of RAM. This will be the first time we'll see any consumer unified SOC hardware do that in direct comparison to Apple (it won't be on package, but still).

thenewperson said:
Yeah this framing makes more sense. I hope Apple responds with better RAM/storage pricing though. It’s been ridiculous for too long

JouniS · Jan 8, 2025

crazy dave said:
But even if I have, you can even see in the pictures, this is a very small device. It isn't going to be fitting a Max-level GPU.

It doesn't look much smaller than a desktop GPU, once you remove the cooling system. You can probably fit any GPU in that space. The question just is how big chip still makes sense if you are limited to 80-100 W.

crazy dave · Jan 8, 2025

JouniS said:
It doesn't look much smaller than a desktop GPU, once you remove the cooling system. You can probably fit any GPU in that space.

It's being described as "palm-sized" and while that ay be a slight exaggeration I've seen pictures of Jensen holding it easily in one hand - it's basically new mini sized if not smaller. The cooling system is rather important for running GPUs. You don't see many desktop GPUs being run in that size case without liquid cooling - even the bigger laptop GPUs can struggle. So, no, in practical terms you cannot just fit "any" GPU into that space.

JouniS said:
The question just is how big chip still makes sense if you are limited to 80-100 W.

Yeah ... even the big mobile chips have TGPs of over 100W.

As far as I can tell the industry has settled on INT8 for "AI TOPs" and Nvidia appears to be using that as well (though at least one website, Tom's below, was confused and put INT8 in one page and FP4 in another). However, INT8 makes the most sense given what I was seeing from previous Blackwell designs - though it isn't clear if it's sparse or dense (Tom's is assuming sparse).

Nvidia GeForce RTX 5090 versus RTX 4090 — How does the new halo GPU compare with its predecessor?

Nvidia is betting heavily on AI and new features, even moreso than with the 40-series.

www.tomshardware.com

Nvidia introduces RTX 5090, RTX 5080, and RTX 5070 laptop GPUs

Portable gaming and productivity see a revamp.

www.tomshardware.com

GeForce RTX 50 Series, Unveiled: Nvidia's New Graphics Cards Promise Massive AI Gains

Here they are! Nvidia's RTX 50-series GPUs, based on its 'Blackwell' architecture, are coming soon, with giant performance boosts in tow for gaming and content creation with AI models.

www.pcmag.com

Even so, Nvidia's consumer chips appear to have a different ratio of TOPs to FLOPs than their professional counterparts, which I'm not entirely sure how since it is all the same tensor cores and tensor core ratios (could be different clock speeds when the tensor cores are running). So could it be as high as a 16 FP32 TFLOP GPU? (number derived from assuming the 5090's "AI TOPS" is sparse FP8, so sparse FP4 would be double and a "1 PFLOP sparse* FP4" would be ~1/6.6*104 TFLOPs.) Basically M4 Max-sized? Maybe. But that's about as power-hungry a chip as I would put in an enclosure that size. If the numbers reported for the 5090 are dense AI TOPS, halve again so about an 8TFLOP GPU. If it's built with the same FLOP to TOP ratio as the professional GPUs, almost halve again and it's a just under 4.5 TFLOPs.

*Again, unsure if this is sparse or dense but everyone quoting it as sparse - Nvidia's marketing doesn't say explicitly but maybe the presentation did.

JouniS · Jan 8, 2025

crazy dave said:
Yeah ... even the big mobile chips have TGPs of over 100W.

TGP is a configurable option. You can have a mobile 5090 using 95 W or a mobile 5070 using 100 W. Or in the current generation, 80 W for the mobile 4090 and 115 W for the mobile 4050. They can use the power budget for running a big chip at a low clock rate or a small chip at a higher rate. The former gives them better performance but wastes die area, which is a scarce resource for them.

The power ranges for the mobile 5070 and 5070 Ti start at 50 W and 60 W, which sounds appropriate for a product like this. They may choose to use a smaller GPU, if they think that higher profit margins justify intentionally underwhelming products.

crazy dave · Jan 8, 2025

JouniS said:
TGP is a configurable option. You can have a mobile 5090 using 95 W or a mobile 5070 using 100 W. Or in the current generation, 80 W for the mobile 4090 and 115 W for the mobile 4050. They can use the power budget for running a big chip at a low clock rate or a small chip at a higher rate. The former gives them better performance but wastes die area, which is a scarce resource for them.

The power ranges for the mobile 5070 and 5070 Ti start at 50 W and 60 W, which sounds appropriate for a product like this. They may choose to use a smaller GPU, if they think that higher profit margins justify intentionally underwhelming products.

But that's why I am talking in terms of performance, not core count. Notice all my numbers are in TFLOPs and I am estimating that based on known TOPs and TFLOPs of current generation Blackwell consumer and professional GPUs. Yes in theory within a given thermal envelope you can get higher TFLOPs by sticking a larger GPU by under clocking it. But you can't then quote TFLOPs for the max clocks and then give TDP for the base clock. Further, the TOP is known (sort of) and therefore anchoring the resulting estimates of TFLOPs.

As an aside, there are times where Nvidia has played fast and loose with TDP. For instance, the L4 quotes having 30TFLOPs at 72W while the 5080 mobile (same die, lower performance) needs 110W with fewer cores and a smaller boost clock. Either the L40 is massively well binned or 72W is not its actual max TDP (GPUs tend to be better about this than CPUs). Nvidia repeats this number all over the place so I'm not sure what to make of that since no other A104 GPU can match that thermal profile. Anyway, back to the main topic:

Your original assumption was that AI TOPs was the same as the FP4 metric. This is what you wrote originally:

JouniS said:
Nvidia's newly announced GPUs are marketed with "AI TOPS". If that's the same quantity, the Digits GPU should be comparable to a desktop 5070 or a laptop 5070 Ti. Those come with 6144 and 5888 CUDA cores, which should be equivalent to 48 and 46 Apple's GPU cores.

The only way to do that is to have the quoted DIGITS PFLOPs be for FP4 dense and the AI TOPs be for INT8 sparse AND for the B10 not have the same ratio of TOPS to FLOPS as the other professional chips (last one is possible since it could be Blackwell 2 "Tesla" to Blackwell 1 as the Ada "Tesla" line was to Hopper). Otherwise it has to be a much smaller, at least half, if not more. That said, in fairness that would be quite small for the normal "Tesla" line of GPUs as I don't seen any in the last couple of generations that small.

mi7chy · Jan 8, 2025

JouniS said:
TGP is a configurable option.

Unlike desktop Nvidia GPU that has user configurable power limit, mobile Nvidia so far doesn't and is set by manufacturer. For example, on my Nvidia GPU laptop it's either 100W or 60W but nothing in between for fine tuning.

crazy dave · Jan 8, 2025

mi7chy said:
Unlike desktop Nvidia GPU that has user configurable power limit, mobile Nvidia so far doesn't and is set by manufacturer. For example, on my Nvidia GPU laptop it's either 100W or 60W but nothing in between for fine tuning.

In this case @JouniS is saying that Nvidia has the ability to raise or lower clocks on the B10 GPU to get to the desired TGP for the DIGITS device (they are the manufacturer) ... and that's true but raising or lowering the clocks changes the performance both for FLOPs and TOPs and we know what the PFLOP FP4 rating is for the B10 (sort of) which thus acts as an anchor for how many FLOPs it must have as well. So ultimately the questions are: it is absolutely confirmed that the DIGITS FP4 metric is sparse (likely)? are AI TOPS for consumer chips INT8 sparse or dense? (EDIT: almost certainly sparse based on 4090) and what ratio of TOPs to FLOPs does the B10 have? is it like the B100? Or the 5090*?

*Earlier when I said I didn't think Blackwell 1 or 2 mattered, it turns out that was incorrect. I'm not sure why it matters, since, as far as I can tell, they are same tensor cores in the same ratio of tensor core to CUDA core, but Nvidia must ensure lower tensor core performance on the consumer chips through lower clocks when the tensor cores are engaged or some other mechanism possibly to ensure people don't buy them over the professional ones. After all they do that with that China specific 5090D (it has the same core count as the 5090 but 29% lower AI TOPS to meet export controls).

==================

EDIT 2:

Aha here we go:

https://twitter.com/x/status/1876676954549620961

~250 FLOPS FP16 (Sparse I'm assuming?) and 512GB/s memory bandwidth. That would put it at ~15 FP32 TFLOPs if it shares the consumer ratio. That's reasonable.

And he brings up the same point I made: it's going after the fledgling Mac mini inference cluster market. However, a comparison to the upcoming 128GB M4 Max Mac Studio isn't awful (probably will be cheaper, but with a worse CPU, and a better GPU for AI - i.e. still a more focused device on GPU AI compute).

EDIT 3: actually this guy may not be accurate, looking around, I thought maybe he was at the event and had more info - he might be, but maybe not (there seems to be debate over the figures). So still up in the air a bit.

leman · Jan 8, 2025

crazy dave said:
~250 FLOPS FP16 (Sparse I'm assuming?) and 512GB/s memory bandwidth. That would put it at ~15 FP32 TFLOPs if it shares the consumer ratio. That's reasonable.

And he brings up the same point I made: it's going after the fledgling Mac mini inference cluster market. However, a comparison to the upcoming 128GB M4 Max Mac Studio isn't awful (probably will be cheaper, but with a worse CPU, and a better GPU for AI - i.e. still a more focused device on GPU AI compute).

EDIT 3: actually this guy may not be accurate, looking around, I thought maybe he was at the event and had more info - he might be, but maybe not (there seems to be debate over the figures). So still up in the air a bit.

If this is accurate, that’s very impressive.

If the GPU is the same as data center Hopper (which is far from obvious), 1 PFFLOPs FP4 should translate to 125 TFLOPS FP16 or 250 TFLOPS I8 (dense), which would be very impressive. Even if these numbers are 2x off, that’s still beyond anything else on the market.

Should Apple want to go after this segment, they won’t get around massively expanding their matrix compute and - what’s even more important - bringing their software up to date.

T'hain Esh Kelch · Jan 8, 2025

leman said:
Should Apple want to go after this segment

I find that very hard to believe. Apple doesn't seem to want to cater much to professionals anymore, especially in such an ultra niche area.

crazy dave · Jan 9, 2025

leman said:
If this is accurate, that’s very impressive.

If the GPU is the same as data center Hopper (which is far from obvious), 1 PFFLOPs FP4 should translate to 125 TFLOPS FP16 or 250 TFLOPS I8 (dense), which would be very impressive. Even if these numbers are 2x off, that’s still beyond anything else on the market.

As far as I can tell for Nvidia the ratios between Tensor FP4-FP16 and sparse/dense are the same on both the professional and consumer GPUs - it also seems to be a factor of 2: dense to sparse is always 2x and moving to smaller precision is always 2x.

What they like to muck around with is the total tensor output. Let's take Ampere and Blackwell:

Ampere	TFLOPs	TOPs (INT8 Sparse)	Ratio
A100	19.5	1248	64
RTX 3090	35.6	568	15.9
Jetson AGX Orin DevKit	5.3	275	51.88

Now Ampere was a special case where data center GPU dies from GA100 did not get the extra FP32 unit per SM but did explicitly get greater Tensor output. It's on Wikipedia and I believe backed up by the white papers. However, it's fascinating that they were willing to make small GPUs out of GA100s for the Jetson. Or was the method of reducing GA10X dies tensor core output something that they could remove to make a small datacenter GPU for the Jetson? - i.e. take a GA107 die meant for a mobile 3050 (same core count and tensor core count) and not reduce the tensor core output. Could they do something similar here? Maybe! For the Blackwell dies it seems pretty obvious they have a way to control the Tensor core output.

Blackwell	TFLOPs	TOPs (INT8 Sparse)	Ratio
B100	60	7000	117
RTX 5090	104.8	3352	31.98
RTX 5090D (China)	104.8	2379	22.7
B10	? (4.5 - 15.5)	500	?

B200 and GB200 have a slightly different ratio probably due rounding (~112). The really interesting one is that the 5090D is exactly identical to the 5090 in hardware (unlike apparently the 4090D), but has its TOPs artificially reduced to meet export.

So the question becomes ... what is the B10? In some ways, truthfully for this market segment the FP32 TFLOPs don't really matter (and I can't help but think they didn't mention any other GPU performance stats for a reason). It's the TOPs that count (and even more so the bandwidth). And if the GPU is otherwise tiny (and thus relatively cheap to produce) and can't serve any other purpose then maybe so much the better from Nvidia's perspective? They aren't marketing this as a general desktop and may not want to eat into any other market. For someone like me who wants a CUDA dev machine, then a 15.5 FP32 TFLOP GPU with 128GB of RAM in a small enclosure would be awfully tempting. I might not buy a dGPU then, would they want that? Meanwhile with a bigger TOP to FLOP ratio they could produce a much smaller GPU, much more cheaply, get much more profit, and still undercut the competition with better specs that matter for this application.

So I started out thinking it would be M4 Pro class (in FP32) based on the chassis, moved to base M4-class based on the professional graphics cards, moved to M4 Max class based on the consumer cards and am now convincing myself it's smaller again!

The bandwidth issue is also interesting. I saw some Redditors claim he was just using numbers from the current Grace Hopper CPU. But it is a reasonable number given that we know it's 128GB of LPDDR5X memory. Apple certainly uses a 512bit bus for that, but AMD's Strix Point now sports that much unified memory too but on a 256bit bus, so who knows? I doubt it would be lower than that (or higher) - not sure it can be lower than a 256bit bus (4x32GB).

leman said:
Should Apple want to go after this segment, they won’t get around massively expanding their matrix compute and - what’s even more important - bringing their software up to date.

Absolutely agreed. Regardless of how much Nvidia has tailored this product for this specific market to knock Apple out of the single batch LLM niche they found themselves in or how powerful for general use the DIGITS desktop ends up being, then Apple, if they want to keep this niche or even further challenge Nvidia's dominance here, need better hardware and software. We'll see if the Hidra desktop chip amounts to anything special or the M5.

T'hain Esh Kelch said:
I find that very hard to believe. Apple doesn't seem to want to cater much to professionals anymore, especially in such an ultra niche area.

It depends, if this niche was all there was, then I might agree (even then I probably wouldn't). But I would also caution against the general wisdom that "Apple doesn't care about professionals". Sure the 2010s were pretty abysmal. But with the advent of Apple Silicon, we've seen Apple (slowly) try to turn that around. Apple didn't just add RT cores for gaming. They didn't massively contribute code to Blender or updating their long neglected scientific libraries for fun (still aways to go). They didn't bump the M4 Max's unified memory to 128GB for normal consumers. Hell they didn't build the Studio, especially the Ultra for them either. While Apple may have accidentally found themselves in this niche and may not have fought for it otherwise, Apple is clearly interested in machine learning and having class leading GPUs (no points for guessing which company Apple has been hiring away from the last 5 years? .. it's Nvidia) and they are actively pursuing several major machine learning projects. They may not ever truly cater to the workstation, classic Mac Pro crowd again (though I would dearly love to see full "Extreme" Apple Silicon just for giggles if nothing else), but this is exactly the non-big-iron "prosumer" that Apple loves. Nvidia is even marketing DIGITS as "for the home researcher". So actually I would almost be shocked if Apple didn't compete here.

leman · Jan 9, 2025

T'hain Esh Kelch said:
I find that very hard to believe. Apple doesn't seem to want to cater much to professionals anymore, especially in such an ultra niche area.

They cater to professionals plenty. You are correct however that this is a fairly niche area, albeit with tremendous growth potential. At any rate, while Digits is a niche product, it spells bad news for Apple's main advantage in the amount of RAM available for professional applications. AMD is also encroaching on this territory. To stay relevant, Apple needs to step up their performance and software game. Of course, they might also choose not to play and stay in the software dev/creative sector, but that would make products like Studio or Mac Pro irrelevant.

innerproduct · Jan 9, 2025

My last tries for simple machine vision tasks using pytorch with MPS backend on Mac was sadly quite terrible. Since our code mainly run on Nvidia GPUS it is not a viable option to move to MPX or something like that (if that works I do not know). I mainly use the Mac as interface to other computers via different remote desktop solutions but I have also done a fair amount of ML coding using pytorch/mps earlier and with very very standard stuff, it works conveniently due to to amounts of memory available but sadly as I said, there are still some basic functionality that is broken.
If I could, I would still prefer to work locally on my 128GB+ Mac and the do longer training sessions on dedicated hardware. Tbh, if pytroch worked flawlessly, I would instantly buy M2 Ultras instead of the extremely priced Nvidia options.
For context, we work in machine vision as I said with the "old" AI/ML that boomed before 2020, that is mostly CNNs for object detection tasks and such. So training is very quick on modern hardware ( our longest runs just takes about 2 days on a single card). But since we work with really high res stuff, memory is paramount. The project Digits is spot on for us I think. I really hope that apple helps fix pytorch and also release a Mac Studio or pro with better perf.

leman · Jan 9, 2025

Confused-User · Jan 9, 2025

crazy dave said:
TFLOPs

Minor pedanticism: You keep writing "TFLOPs" and "TOPs" in your (excellent) posts. The "s" needs to be capitalized, as it's part of the acronym, not a pluralization.

crazy dave · Jan 9, 2025

Confused-User said:
Minor pedanticism: You keep writing "TFLOPs" and "TOPs" in your (excellent) posts. The "s" needs to be capitalized, as it's part of the acronym, not a pluralization.

Oh ... uh ... yup. Oops.

I'll try not to make that mistake in the future. Thanks!

innerproduct · Jan 12, 2025

According to this https://wccftech.com/m4-ultra-launching-in-h1-2025-for-the-update-mac-studio/ article there is a chance of a Mac Studio with max and ultra and a Mac Pro with something even more powerful. (references Gurman as source). If that comes through, it could be the best case scenario many of us are hoping for.

DrWojtek · Jan 12, 2025

Any rumors/truth to this ?

M4 Mac mini Is A Capable Machine For Running Older AAA Games Like Death Stranding; Base Configuration Pushes The 60+FPS At Very High Quality Settings

M1-M5 is one (first) generation ending with 3nm fabrication. M6-M10 is second generation starting with 2nm EUV Lithography. Everything in between is an iteration of the former where Asahi Linux clarifies that even further. You're almost at M5, all that experience from 1st generation would be...

disq.us

M1-M5 is one (first) generation ending with 3nm fabrication.
M6-M10 is second generation starting with 2nm EUV Lithography.

Everything in between is an iteration of the former where Asahi Linux clarifies that even further.

M4+ Chip Generation - Speculation Megathread [MERGED]

macrumors 603

macrumors 68040

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 65816

macrumors 6502a

macrumors 68030

macrumors 68000

macrumors 6502a

macrumors 68000

macrumors 6502a

macrumors 68000

Suspended

macrumors 68000

macrumors Core

macrumors 604

macrumors 68000

macrumors Core

macrumors 6502

macrumors Core

macrumors 6502a

macrumors 68000

macrumors 6502

macrumors 6502

Our Staff