M4 Max Studio 128GB - LLM testing

richinaus · May 23, 2026

Populus said:
Yeah, but that’s now, with cloud LLMs heavily subsidized. Wait for them to charge you the real cost, and we’ll see if it’s sustainable.

In the meantime what’s the alternative except either investing massively in new hardware, or use the cloud now which is much faster than my 4090 and change later if needed?

Populus · May 23, 2026

richinaus said:
In the meantime what’s the alternative except either investing massively in new hardware, or use the cloud now which is much faster than my 4090 and change later if needed?

In the meantime, indeed, it’s a good idea to take advantage of those heavily subsidized services… but keeping in mind that those aren’t the real costs, so that we can build an alternative as soon as they start to rise in price.

I mean, I’m using Claude right now, but I keep exploring local LLMs to make the switch whenever I need it.

Hopefully Apple will invest more in local AI for future Apple Intelligence tools.

xxray · May 23, 2026

novagamer said:
For anyone reading this lately and wanting to run the newer models, Gemma 4 31B is pretty good but will go to 80-90gb easily, and that’s before long context. Get as much memory as possible if you want to use these models locally. Hopefully the next generation of laptops will finally go above 128gb, but we have a year wait probably.]

M5 Ultra can’t come soon enough.

What are you using it for? I've been debating between 64GB and 128GB when the M5 Mac Studio comes out and have been leaning towards 64GB because I just don't know if I can stomach an additional $800 (the price of a phone, tablet, or laptop) for just more RAM.

I've been able to get more use out of 16GB of RAM on my M1 Pro MacBook Pro for local LLMs than I was expecting.

novagamer · May 23, 2026

xxray said:
What are you using it for? I've been debating between 64GB and 128GB when the M5 Mac Studio comes out and have been leaning towards 64GB because I just don't know if I can stomach an additional $800 (the price of a phone, tablet, or laptop) for just more RAM.

I've been able to get more use out of 16GB of RAM on my M1 Pro MacBook Pro for local LLMs than I was expecting.

Research work; if you're happy with what you're getting out of 16gb by all means save the money, but Gemma-4 31B empirically does use 80-90gb even without long context, so I think something like 192gb would be necessary if you used it at long context plus other workloads at the same time.

I'm doing some very specific math heavy things that I can't get into so I am constantly hitting RAM ceilings – earlier tonight I hit OOM around 142gb and that was not even with a large model.

There is a danger in overbuying if you don't really need it, quantized models can be decent but not for the work I'm doing I have to run a lot of stuff at FP64 which is a whole 'nother can of worms. I'm kind of within a niche of a niche right now, but hopefully not forever.

dblissmn · May 25, 2026

How heavily subsidized does everyone think the cloud LLMs actually are? I'd heard something like OpenAI costing $1.25 for every dollar they actually charge; reaching historically normal business margins would presumably call for a price increase of about a third and reaching the kind of margins tech companies seem to like to go for these days might call for something closer to double the price at that rate.

dblissmn · May 25, 2026

novagamer said:
Research work; if you're happy with what you're getting out of 16gb by all means save the money, but Gemma-4 31B empirically does use 80-90gb even without long context, so I think something like 192gb would be necessary if you used it at long context plus other workloads at the same time.

I'm doing some very specific math heavy things that I can't get into so I am constantly hitting RAM ceilings – earlier tonight I hit OOM around 142gb and that was not even with a large model.

There is a danger in overbuying if you don't really need it, quantized models can be decent but not for the work I'm doing I have to run a lot of stuff at FP64 which is a whole 'nother can of worms. I'm kind of within a niche of a niche right now, but hopefully not forever.

Nonetheless, seeing as there's literally nothing out there available above 128GB, is there a good use case for 128GB, or is it sort of a "tweener" configuration that leaves you better off just sucking it up on a smaller memory if you have to buy now and offload everything else to the cloud?

novagamer · May 25, 2026

dblissmn said:
Nonetheless, seeing as there's literally nothing out there available above 128GB, is there a good use case for 128GB, or is it sort of a "tweener" configuration that leaves you better off just sucking it up on a smaller memory if you have to buy now and offload everything else to the cloud?

I’m already hitting the limits so I can’t really do more, if I upgraded it would be for the ~30% speed up and 3x-5x inference time to first token speedup vs. M4.

There ae some experiments I want to do with quantum ‘agents’ which need enormous amounts of memory to approximate with python, I think 3 of them will barely fit within 512GB so I’m really, really hoping that config comes back. My work is not really a normal “I’m just writing some code” use case, I do sometimes use AI to orchestrate local runs – I do not offload my source code to the cloud, it all stays on disk and runs locally. My issue is that a lot of the primary work I’m doing eats ~80-150gb if I let it, which means that to run local AI alongside that is …difficult.

I think 128GB is a fine config for most power users, given how much memory costs I think Apple’s prices right now are a bargain honestly. I expect the studio will raise them, sadly.

For “normal” use a quantized Gemma-4 or a lower parameter count will be fine, it’s a really good model. I did some HLE testing and it solves some of them out of the box, albeit at ~15-45 minutes each on the M4 Max. Also the faster disk on M5 is nice, there is some newer work that streams larger models in via a sort of sliding window mechanism, you can google for more on that. People are getting good results with the newest Qwen also but I haven’t tried that yet.

TL;DR I think if it’s a computer you’re going to keep for a long time I think buying the most you can is worthwhile, if you are a power user. If you’re just curious and don’t have a use case, probably stick with less.

dblissmn · May 25, 2026

novagamer said:
I’m already hitting the limits so I can’t really do more, if I upgraded it would be for the ~30% speed up and 3x-5x inference time to first token speedup vs. M4.

There is some work I want to do with quantum ‘agents’ which need enormous amounts of memory to approximate with python, I think 3 of them will barely fit within 512GB which is the minimum I need for one of my projects so I’m really, really hoping that config comes back. I’m doing work with things like diffusion and complex graphs and lots of things like that so it’s not really a normal use case, I also sometimes use AI agents to orchestrate local runs – I do not offload my source code to the cloud at all, it all stays on disk.

I think 128GB is a fine config for most people, given how much memory costs I think Apple’s prices right now are a bargain honestly. I expect the studio will raise them, sadly.

For “normal” use a quantized Gemma-4 or a lower parameter count will be fine, it’s a really good model. I did some HLE testing and it solves some of them out of the box, albeit at ~15-45 minutes each on the M4 Max. Also the faster disk on M5 is nice, there is some newer work that streams larger models in via a sort of sliding window mechanism, you can google for more on that.

TL;DR I think if it’s a computer you’re going to keep for a long time I think buying the most you can is worthwhile, if you are a power user. If you’re just curious and don’t have a use case, probably stick with less.

This is helpful, thanks. I have a MBP on order with 128, the key thing being I really want to do local AI on large PDF archives and book manuscripts and generally raise my game on research. This in addition to multitasking tendencies and the usual collection of photo/video etc.

novagamer · May 25, 2026

dblissmn said:
This is helpful, thanks. I have a MBP on order with 128, the key thing being I really want to do local AI on large PDF archives and book manuscripts and generally raise my game on research. This in addition to multitasking tendencies and the usual collection of photo/video etc.

Don’t sleep on GPT Pro if you can afford it and are OK with the possibility of research being leaked or reviewed by humans. If you aren’t doing critical IP / patent-level work it’s very worth the $200/mo, if you are the cost is eye watering ($160 per million output tokens for 5.5 Pro). Obviously if you use any paid model turn off training on your data, which you can for Claude and GPT but not Gemini unless you use a private chat which does not persist state.

For the consumer version use the web and “extended research” and occasionally combine it with ‘/deep research’ to get the breadth. I don’t think any local model will come close to being able to synthesize like GPT Pro can for probably a year, but I hope I’m wrong. I like Claude much better for most things but for breadth and depth pulling in 600 documents GPT Pro is extremely good.

I used AI to translate some estoeric research work from a foreign language for me and used that to synthesize some other stuff, it’s pretty great the tools we have access to despite the obvious societal downsides.

Local models will do just fine with converting PDFs and managing things, they are pretty capable now, and some can even wire into harnesses similar to Claude Code, but I haven’t seen good demonstrations for the web search abilities like I have with the frontier labs; they simply have more reach and speed right now – although I haven’t done a ton of experimenting with local models for this purpose.

wegster · Jun 4, 2026

Fun thread. 🙂
I've been running Qwen-Coder-Next 80B on my sort-of aging M1 Max MBP 64GB. A whole lot of playing around and a few bake-offs, with the intent of using via CLI and for vscode integration via Continue. I'm not 100% dialed in yet, but am running 64k context, and am pretty much at the bleeding edge - as someone else mentioned, I certainly had hit memory pressure red to the point of shutdown during some fine-tuning.

I'm running via locally configured and compiled llama.cpp as a launch service, serving local and for a few Linux dev VMs. Exact model: unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M.gguf
llama.cpp compiled with Metal support, etc.

Started down this path after being burned badly with Bolt (*super short* context it seems), and have worked in AI/ML for a handful of years (although not in LLM/generative space), so wanted to see where I could get to. This led to a long but useful rabbit hole of building my own agentic AI guardrail system, and reduced credit/token use by over 90% (no, it's not 'magic prompts' 😉 ), but also went looking at how far I could go locally.

Some interesting bits - I did compare vs Devstral/Mistral variants and a few others. QwenCoder-Next is an MOE model with selective hyperparam activation. The difference in response/tokens per second as well as time to response was a HUGE difference with Qwen3-Coder-Next coming out on top by quite a bit. The others simply weren't usable without screaming. Yes, I could have dropped down on the params and looked for a 30B (I only use 1.5B for code completion), but this is where I landed <right now> after a good amount of testing including code tests and reviews.

I probably could reduce my context size a bit, and increase max_tokens and prediction sizes.

In general, I use 'web ai' for planning discussions, finalize into a 'spec' I dump in the repo, then let either Codex or local model implement, or I implement part, then have it refer to the spec to complete and test. Sometimes I'll round robin and have web AI give me the local agent prompt and feedback ack/forth while jumping in if/when I see things going funny/wrong, etc.

Continue was kind of a huge pain to get sorted, so was vscode honestly. I'm a former sr sw engineer/architect, but moved into leadership roles some time back, so part is fun, but part is also me being a bit rusty, and I never preferred MS tools for well - anything at all, as a long-time Unix/Linux engineer. Did a few frustrating rounds of trying o get proper permissions enabled in Continue, while vscode config is literally *polluted* IMO with copilot settings, which was like a frigging virus - every time I thought I had banished it forever from my vscode, it found a way to pop back up like a virus. It didn't help with so MANY settings in vscode that sure sounded generic but were literally only for copilot, but got there, and will never touch/enable copilot in vscode again/it's banished forever for me. 😀

I also in general, refuse to shift to pay-per-token/API models at this point, as it's quite easy to see things jump pretty significantly. Of course, inevitable as someone said, pricing WILL go up, but if you're using paid models/plans, you do NOT need flagship Opus or GPT for 90% of most tasks. A buddy was burning around $1k/month, and a prior very senior sw engineering team I had managed was routinely running out of tokens. While some of what they're doing is pretty complex work, 90% of it could be done on Haiku, 5.4-mini or similar, and the price multipliers are HUGE on the flagship models. Right now, I use a paid GPT account for web AI, I keep a small paid plan on Bolt (always use your own project github, private is fine) for claude/haiku/sonnet/opus), and swap between codex (mostly 5.4-mini medium) and Qwen for coding tasks.

I know we're mostly talking about local, but ask your AI of choice or see if you can actually find a rate page for your brand model of choice - the difference in flagship vs 'pretty damned good' can be pretty crazy, even approaching 10x or more. Until OpenAI hosed me a bit (they seem to have dropped non-Spark 5.3-codex from the list in Codex app), I'd do 5.4-mini or Qwen, only occasionally bumping up to 5.3-codex. 5.4-mini is around 15% of GPT5, and still like 25% of 5.4 non-mini, while Codex is a bit less that 1/2 of the flagship, and slightly less than 5.4 non-mini. (not spark - they don't even tell you what it costs, which is hokey). So yeah, if you're going to use paid, consider which specific model to use when and your bank account and sanity will thank you.

Qwen3-coder-next generally performed at least as well as 5.4-mini and approaching 5.3-codex, occasionally but rarely slightly better, most of the times not-quite-as good. Single file/simple edits/patterns zero issues, but when you get into more complex tasks as in repo-wide refactors, this is where you want a higher-powered model or a detailed execution plan, possibly worked out with a web ai model with stronger planning/reasoning vs coding-focused agentic models.

So yeah, I now use Qwen and Codex almost interchangeably in practice, was doing Qwen for 90% of the work, but as I put my agent framework/guardrails in place and watched my credit usage plummet, it's a flip of the coin as to which I use. I probably need one more round of tuning on Qwen, but it's not locking up the system any more (yay 😉 ), although Codex remains faster vs the M1 Max.

Next step most likely is an M5 Studio 256GB if/when they announce it. I'll offload model use there, and keep the M1 Max MBP for a while longer as my travel system, and see what the 'mbp ultra' actually turns out to be. The M1 Max MBP is still an overall competent system - right up until I wanted to push the bleeding edge on RAM use with it. 😀

BeatCrazy · Jul 1, 2026

JSRinUK said:
I stepped away from local LLM for a little while. Once you’ve tested the big models, you need to have a real use for them to use them in earnest. For me it’s mostly about brainstorming my stories and, once you have your preferred model, you’re kind of already there.

I did do some “coding tests” a little while back in which I pitted frontier AI against large LLMs on my Mac. The results were not much to write about. The local LLMs pretty much failed. They would generate code, but the code would then generally fail. They create the functions you need for your code to work, but then don’t call them - because they’ve forgotten about them by the time it needs to use them. They may have long context, but they have short memory.

Even frontier AI didn’t fare much better. They’re generally so quick to go down the “popular path” of coding, that they trip right up when you tell them “I have a Mac, not CUDA-core stuff”.

You could work with the frontier AI, but you’d find yourself constantly returning to it with errors - often silly things that, if you already know a bit about coding, you can fix yourself. Sometimes it felt like I was teaching the AI - at which point, I’d call the test a failure.

The only one that showed any sign of help (despite not being completely flawless) was Claude Sonnet - I didn’t have Opus at the time (I do now).

Instead, I recently shifted my focus to image generation. As with local LLM, my primary goal is that everything should remain local - no “calling out to the web” once the models are downloaded. I had a poor experience with the “node hell” that was ComfyUI several months ago, so I’m avoiding that for now.

Instead, I’ve been leveraging https://pypi.org/project/mflux/ which is an MLX port of several generative image models.

Mflux supports the following models:

Z-Image Turbo,

Flux.1 (including variants),

FIBO,

SeedVR2

Qwen-Image (including Qwen-Image-Edit).

I see today that they’ve added Flux.2 recently (so I’m going to put that on my “to do list”).

With the help of Claude doing (most of) the coding, I’ve assembled an “image generation toolkit”, which consists of the following python “apps”:

Prompt Workshop

This sends your simple prompt (eg. “a helicopter hovering over a hot dog stand”) and leverages a local LLM to enhance it into something more elaborate, to assist the Image Generation model. I’m using a VLM so that I can also include an image, and the “enhanced prompt” will use that as a reference (colours, style, cinematography, etc)

A second function of this app is simply to have a local LLM describe what it sees in an image I provide. So, if there's an image I like and I want to generate it elsewhere with some modifications, this will assist me with that.

Multi-Model Image Generator

The intention here is to take my image generation prompt and generate the image. I can select the size I want, plus alter other parameters, and whatever quantity of images I want (with the seed varying).

I can also select from up to 6 different image generation models (all the ones mflux supports, but not Flux.2 yet), so that I can select my favourite.

Multi-Model Image Editor

Here, I provide one or two images and describe what I want to change in the image.

I can select from up to 2 different image editing models. If the chosen model can accept both images, it'll use them. Otherwise it'll work on just the first image.

Image Upscaler

This does what it says on the tin. I provide an image, pick a size or scale factor, and I'm provided with an upscaled version of the image.

Background Remover

A new addition to the toolbox. I provide an image, and within a few seconds, it creates the same image with the background removed (or with a plain black, or plain white background). Optionally, the mask is also generated (which could be useful if I wanted to edit in a graphics app).

Just today, I got Claude to combine each of these scripts into a single “one click and forget” script. You put in your simple prompt, and it automatically gets the enhanced prompt, sends it to the image generator, upscales the image, and removes the background (with all images being saved at every stage).

It’s amazing what we can do on a desktop computer these days.

@JSRinUK thanks for starting and updating this thread with your journey.

Although I'm a current M1 Max 32GB Studio user with some Ollama stuff, I started getting interested in image/text -> video generation that I started with ComfyUI via Windows/NVIDIA. Although I got it working,I'm always close to being 'in over my head' and risk of breaking stuff if I push it too far.

So, ideally I'd move this workflow over to macOS and I just secured a M4 Max 128GB via Apple Refurbished site.

Have you tried any image/text to video generation? Anything out there today what also provides sounds/voice plugins?

Search

Search

M4 Max Studio 128GB - LLM testing

richinaus

macrumors 68030

Populus

macrumors 604

xxray

macrumors 68040

novagamer

macrumors 6502a

dblissmn

macrumors 6502

dblissmn

macrumors 6502

novagamer

macrumors 6502a

dblissmn

macrumors 6502

novagamer

macrumors 6502a

wegster

macrumors 6502a

BeatCrazy

macrumors 603

Our Staff