Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.
So, there's a tip that we really already knew - don't try using a 20GB LLM model when you only have 24GB (unless you have an awful lot of time to waste waiting for it!).
Do you think an M4 Mac (10/10) with 32GB of RAM would be better for 20GB models like your example, than an M4 Pro (12/16) with 24GB with higher bandwidth? I’m still really struggling with that decision…
 
  • Like
Reactions: mainemini
Do you think an M4 Mac (10/10) with 32GB of RAM would be better for 20GB models like your example, than an M4 Pro (12/16) with 24GB with higher bandwidth? I’m still really struggling with that decision…
This is all new to me, as I’m just experimenting.

I can run gemma2:27b, which is 16GB, on my MBP with 24GB. It can sometimes take a little time to come up with a response, but it’s generally quite usable.

However, llava:34b, which is 20GB, is totally unusable.

My take from this is that 32GB RAM *should* run the 20GB model roughly the same as the 16GB model runs on my 24GB machine.

I’m of the mind that it’s mostly all about the RAM. It’s possible (probable?) that the Mac can’t give all of its RAM to the model, hence the 20GB model is too much despite having 24GB RAM. Theoretically, the 20GB model in a 32GB Max should be okay. I don’t believe the bandwidth is the deciding factor - the RAM is. Likewise, a Mac with less RAM will probably do okay with the smaller models (there are plenty to play with). I’m definitely no expert here, though.

That all said, if I was to look seriously (or even semi-seriously) at LLMs or AI image creation, I’d also be looking at other things in addition to the RAM.

When the LLM is thinking (or when AI image generation is working), the M4 Pro chip is really put to work. Temperature goes up, the fans kick in, and my MacBook gets quite warm. It can also hammer the battery if you’re not plugged in (I was playing about with it before work this morning, and the battery went from 100% to the 20% warning in about two hours).

So, if I was to look at a machine to do this regularly, I’d probably look at something like the Studio with plenty of RAM - but that’s going to be real expensive, so I’d really need to *want* to do that. Maybe I’ll save up to get one in 2027. :)

For now, I’m content that my MBP can do what it can do (I love experimenting and pushing its limits) - but I won’t be doing much of it on battery power.
 
This is all new to me, as I’m just experimenting.

I can run gemma2:27b, which is 16GB, on my MBP with 24GB. It can sometimes take a little time to come up with a response, but it’s generally quite usable.

However, llava:34b, which is 20GB, is totally unusable.

My take from this is that 32GB RAM *should* run the 20GB model roughly the same as the 16GB model runs on my 24GB machine.

I’m of the mind that it’s mostly all about the RAM. It’s possible (probable?) that the Mac can’t give all of its RAM to the model, hence the 20GB model is too much despite having 24GB RAM. Theoretically, the 20GB model in a 32GB Max should be okay. I don’t believe the bandwidth is the deciding factor - the RAM is. Likewise, a Mac with less RAM will probably do okay with the smaller models (there are plenty to play with). I’m definitely no expert here, though.

That all said, if I was to look seriously (or even semi-seriously) at LLMs or AI image creation, I’d also be looking at other things in addition to the RAM.

When the LLM is thinking (or when AI image generation is working), the M4 Pro chip is really put to work. Temperature goes up, the fans kick in, and my MacBook gets quite warm. It can also hammer the battery if you’re not plugged in (I was playing about with it before work this morning, and the battery went from 100% to the 20% warning in about two hours).

So, if I was to look at a machine to do this regularly, I’d probably look at something like the Studio with plenty of RAM - but that’s going to be real expensive, so I’d really need to *want* to do that. Maybe I’ll save up to get one in 2027. :)

For now, I’m content that my MBP can do what it can do (I love experimenting and pushing its limits) - but I won’t be doing much of it on battery power.
Okay, I think I’ll try to experiment with my current M4 Pro with 24GB of RAM and when I “downgrade” to the M4 with 32GB of RAM I’ll run the same tests, and compare.
 
  • Like
Reactions: mainemini
Just for my own amusement, I thought I would fire up my old Mac mini (late-2014) to see if it could run an LLM.

With just the basic install of Ollama (and Docker for the Web-UI), running the default llama3.2:latest (3.2B), it does actually work.

My old Mac mini runs Monterey on a 2.8GHz i5, 16GB 1600MHz RAM, Intel Iris 1536MB graphics, and a 1.12TB Fusion Drive (part of which is boot camped off for Windows).

My primary reason for doing this was because I'm considering a base £599 M4 Mac mini as a backup/desktop/second computer and I want to see (i) if it would be capable of running LLM and, if so, of what size; and (ii) if I would see a big performance drop from my MacBook Pro M4P. I figured that, if the 10-year old machine could at least run something, then I'd be fairly good to go for a new M4 base Mac mini.

The good news is that it does, indeed, run an LLM. The amusing / intriguing thing is just how it runs it. On my M4P MBP, anything to do with LLM runs on the GPU, and you can see the GPU max out while it's working. The CPUs are barely hit (just the efficiency cores probably handling whatever else the Mac is doing). It runs reasonably fast even if I have a local Stable Diffusion in the background, despite using a good amount of swap - which is probably helped by the memory bandwidth.

The 10-year old machine is a different story. According to the activity monitor, the GPU (the Intel Iris 1536MB) isn't even tickled, never mind used. Everything runs on the CPU's 4-cores. These are typically around 40-50% used when not doing a lot, so the LLM maxes them out - requiring some waiting time. I asked both machines the same question ("What's the largest LLM I can run on a Mac mini (late 2014) with 16GB RAM?" - to which I got slightly different answers). The older machine took almost three minutes to answer, while the MBP took 18 seconds. So, it's usable if you're nursing your cup of coffee while waiting for it.

Clearly it's not a practical exercise to run the LLM on such an old machine but I'm reassured that, if that can do it, so can a modern base M4 Mac mini (if I was to buy one, I would *not* be increasing any of its specs from base). 16GB RAM will limit which LLMs I can run, and the slower memory bandwidth will, I image, slow things down a little, but I feel it'll be usable as a second machine.
 
  • Like
Reactions: eva2000 and Sully
I’ve been playing around with Mixtral using the DuckDuckGo AI chat. It’s interesting and somewhat helpful for parts of my workflow.

I’m tempted to run this locally using LM Studio, but I’m not sure it’s with the effort. For those who run these LLM’s locally, what is your use case and why do you do it?
 
I’ve been playing around with Mixtral using the DuckDuckGo AI chat. It’s interesting and somewhat helpful for parts of my workflow.

I’m tempted to run this locally using LM Studio, but I’m not sure it’s with the effort. For those who run these LLM’s locally, what is your use case and why do you do it?
My “use case” is brainstorming my novels.

Many years ago, I wrote a series of stories (for hobby purposes) and I kind of lost track of them over the last few years. I’m just getting back into them now, and I think LLMs could help me edit / improve them. Also, there’s a lot of information in there (the stories are linked) which, now, I have trouble keeping track of.

The main issue with LLMs is their limited memory. You start off by brainstorming chapter one and, by the time you’re on chapter eight it’s forgotten chapter one - which is very difficult when you establish a plot point that comes back later and the LLM thinks it’s something brand new.

On that issue, I’ve been looking at ways of getting around the limited memory issue. Today I looked at implementing RAG (basically that’s having documents pre-loaded which the LLM uses to supplement the LLM that you’re using without having to retrain the whole thing on new information). I did try just adding an attachment to a prompt but the LLM got royally confused about different elements of the text. The RAG method, though not infallible, looks like it might work. After setting it up today, I was able to ask the LLM questions about the story and it was pretty reliable in its answers - although it’s still not completely accurate, so I’m going to look at ways to optimise the information more concisely.

I also have Stable Diffusion set up so that, when discussing a scene, I can hit the “image” button and (after a while) I’ll have some kind of visual representation of a scene or a key moment in a scene. That’s not 100% accurate either (and it can take some time to generate), but it’s something I’m interested in exploring.
 
  • Like
Reactions: mainemini and Sully
...On that issue, I’ve been looking at ways of getting around the limited memory issue. Today I looked at implementing RAG (basically that’s having documents pre-loaded which the LLM uses to supplement the LLM that you’re using without having to retrain the whole thing on new information). I did try just adding an attachment to a prompt but the LLM got royally confused about different elements of the text. The RAG method, though not infallible, looks like it might work. After setting it up today, I was able to ask the LLM questions about the story and it was pretty reliable in its answers - although it’s still not completely accurate, so I’m going to look at ways to optimise the information more concisely.
Forgive a beginner's question, how specifically are you integrating RAG on a Mac? I thought you couldn't do that with LM Studio.
 
Forgive a beginner's question, how specifically are you integrating RAG on a Mac? I thought you couldn't do that with LM Studio.
I’m using Ollama. There’s a section called “knowledge” into which you upload a document or collection of documents. You then create a new “model” based on a pre-existing model to which you attach this collection.

When you then query your new model, it will first look into the documents of that collection for an answer and, if it doesn’t find one, it’ll default back to the model it’s based on just like before.

It kind of works, but I’m still learning how to best optimise my documents to make them most effective.
 
  • Like
Reactions: mainemini
@VitoBotta @mainemini or anyone playing with local fine tuning like lora

Have you had any luck using Vitobotta’s MacMini specs “ 14 CPU cores, 20 GPU cores, 64 GB of RAM, and 2TB of storage“?.. or using anything similar?
 
I posted previously about having a bash setting up Ollama on my old Mac Mini (late-2014) and that it can work, albeit slowly.

On a similar note, now that I have a MacBook Pro, I don’t use my MS Surface Pro 6 any more. Today I did a wipe and clean install of Windows 11 on it, and then installed LM Studio (I was curious as to how different LM Studio is to Ollama). I didn’t know how well it would run an LLM, but it’s currently running the small 1B Llama-3.2 quite well. It takes a moment to analyse your prompt but it answers in a timely fashion and isn’t too slow. It’s definitely usable.

That machine only has an Intel 620 for GPU (I think it shares up to 8GB RAM with the machine’s 16GB total RAM, but I’ve never really looked into it), but I’m quite impressed to discover that LLMs can run on what are quite humble machines by today’s standards.
 
  • Like
Reactions: mainemini and Sully
I run the Llama 3.3 70B Q4_K_M (4-bit quantized, K-means, medium, balanced quality) parameter model (knowledge cutoff date of Dec 2023) on my M1 Max (64GB). It takes 43GB, but is 100% GPU! So, it's reasonable in speed. And Llama 3.3 70B is roughly equivalent to Llama 3.1 405B. It's amazing 70B fits. For example, here is it running on yesterday's news about Trump and Fauci. GPU is pegged at 100%. Instructions on how to run is in there, install ollama first from https://ollama.com/. You can see the token generation rate (around 6.5 tokens/sec).
 
Last edited:
Forgive a beginner's question, how specifically are you integrating RAG on a Mac? I thought you couldn't do that with LM Studio.
I just did this in LM Studio with Llama 3.2. You just attach a file within your chat and you can interact with the document.
 
I've been trying out other LLM tools - so far that's been Ollama, LM Studio, and Jan. I'd like to use LM Studio and Jan more, but the more I ask them the more gibberish it comes out with. Maybe there's a setting to change that, but I don't get that from Ollama. More experimentation is clearly in order. :)
 
LM Studio says certain models that I know run fine in Ollama 100% in GPU are too large. For example, Q4_K_M below.
View attachment 2476153
I notice that, which is why I’ve stayed small (below 5GB) on LM Studio and Jan. I don’t want to run anything too big because I’m also doing some image generation at the same time. Both together really push my MBP. I can see myself with a Studio in the near future!
 
Try Ollama. It seems to allow me to run bigger models on my M1 Max, so I use that. LM Studio doesn't even let me load a 43GB model (which is under the 75% limit GPU max memory available under my 64GB RAM). 75% of 64GB is 48GB.
View attachment 2476171
I use Ollama most of the time. I’m just experimenting with what alternatives are out there.

Right now I’m using gemma2:latest (9.2B) with a couple of documents in Ollama's Knowledge to brainstorm some scenes in my current story that I’m writing. I’ve also been doing the same with the online version of ChatGPT to have some comparison.

My MBP has 24GB RAM and I have run gemma2:27b through Ollama before, but it’s too slow to be realistically practical for any real world purpose, so I’ve dropped down to a more useable one for regular use.

I’ve used "Gemma The Writer N Restless Quill 10B” on LM Studio and "Gemma-The-Writer-N-Restless-Quill-V2-10B” on Jan. After a few prompts and some analysis of my scenes, the responses become less and less legible. Just asking a few questions and not having a long conversation is just fine with those, though.
 
My MBP has 24GB RAM and I have run gemma2:27b through Ollama before, but it’s too slow to be realistically practical for any real world purpose, so I’ve dropped down to a more useable one for regular use.
Small models are terrible and a waste of time IMHO. I have another MacBook M3 with 24GB, it is hopeless running a local LLM, I stopped trying.
The difference is the RAM size. More specifically, how much memory can be dedicated to the GPU. The unified memory model is a big win for Apple Silicon in this respect.

The 64GB M1 Max is amazing with a large model, 70B Q4_K. I'm going to try to see if the bigger Q6 variant fits, should be even better.
 
Small models are terrible and a waste of time IMHO. I have another MacBook M3 with 24GB, it is hopeless running a local LLM, I stopped trying.
The difference is the RAM size. More specifically, how much memory can be dedicated to the GPU. The unified memory model is a big win for Apple Silicon in this respect.

The 64GB M1 Max is amazing with a large model, 70B Q4_K. I'm going to try to see if the bigger Q6 variant fits, should be even better.
I didn’t buy the MBP with LLM in mind. I only started exploring them when I discovered the MBP could handle them.

I don’t find the smaller models a hindrance for the purpose I use them for, but I’m certainly looking to the future. If I decide to continue down this road, I’m considering a Studio later this year - something with at least 96GB RAM. I need to be sure it’s something I want to do, though, because it’ll cut into my finances pretty heavily.
 
Small models are terrible and a waste of time IMHO.

I’m new to this. I’ve used 7b to 16b models so far. I believe 70b will be too large for my 48 gb m4 Pro.

These medium sized models seem pretty good to me so far. But, I’m not doing anything very complicated.

Do you have some specific examples of why the smaller models are “terrible and a waste of time?”
 
Do you have some specific examples of why the smaller models are “terrible and a waste of time?”
My YouTube video linked above shows good behavior when it comes to truth on Llama 3.3 70B Q4_K_M.

However, in my informal testing, the smaller models seem to go off the rails easier. And even within the same family of models, e.g. 70B, there are significant differences depending on quantization level, e.g. Q2-Q6.

Even the 70B Q4_K_M model (to my surprise) hallucinated badly on one set of examples from my model rail hobby, rather than spit out correct information. (I suspect it has rather more data on Donald Trump and Dr. Fauci than on Z gauge, and so it does better on the former topic.)

That's why I'm downloading the Q6_K version. And if that runs, I want to see if it exhibits the same behavior. Then compare it with the full fp version (I guess that one is available online only).
 
Last edited:
My main issue with LLMs is their inability to remember early parts of a discussion. Even with a document attached, it often doesn’t find the information I need.

I’ve tried drafting scenes with it and while, on the whole, it comes up with a good starting point, it’ll often get a name wrong, or a person’s job wrong, or what they look like wrong, even though these things have been discussed previously.

Do larger models have longer memories? If that’s a feature of a larger model, then it will go some way towards convincing me of a future purchase.

But, generally, despite using similar size or smaller models, I don’t find the models on Ollama spout gibberish the way they do after a while on LM Studio or Jan.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.