Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

VitoBotta

macrumors 6502a
Original poster
Dec 2, 2020
912
361
Espoo, Finland
I got the mini with 14 CPU cores, 20 GPU cores, 64 GB of RAM, and 2TB of storage. I'm really glad I didn't go for a model with less memory because I wouldn't have been able to run large language models locally.

Now, I can use LMStudio to run both the standard version and the coder version of Qwen2.5, which has 32 billion parameters. The inference speed is pretty good, around 11-12 tokens per second, which works well for real tasks. It's great that I can keep both models in memory all the time to avoid delays before each response.

I use LLMs a lot and now I prefer using local models for general tasks and coding—I even switched from Github Copilot. I only go to paid models on OpenRouter when the Qwen models don't give me the results I want.

Anyone else running LLMs locally on Apple Silicon Macs?
 
I'm rocking a 48 gb m4 max model for now. Playing around with LLMs for the first time (I know, right, where've I been the last few years?). I love it, can't get enough of it!

Might have to upgrade to 128 now knowing that these and larger models exist. I'd like to try out the FP16 version of qwen2.5-coder-32b, 66 gig download :eek:...

Thanks for pointing me to LM Studio. I like it a lot, especially since it retains the model in memory between requests. That seems to speed things up quite a bit compared with ollama I've been using. Can't find the base models in LM Studio, though (for fill-in-the-middle code completion...doesn't work right with the instruct models). I have them in ollama but then that means I have to run both ollama and LM Studio.

Are you using an IDE with it? So far I seem to have settled on VS Code with Continue extension, although Cursor looks pretty good too (just don't know how to configure it for a local model)
 
  • Like
Reactions: AlexMaximus
From what I've gathered, Q4 is pretty efficient and uses way fewer resources compared to fp16. I'm still pretty new to this, so take my opinion with a grain of salt.

I've made some progress since my last update. Right now, I'm using LMStudio just for downloading models because it simplifies that process. For actual model inference, I switched to using Llama.cpp directly. This allows me to use prompt caching, which speeds up the evaluation of prompts. If you're not familiar with it, here's the gist:

- Most UIs and client applications interact with language models using an OpenAI-compatible API.
- This API has a parameter called `cache_prompt` that, when set to `true`, enables prompt caching.
- However, many clients don’t set this parameter, so responses get slower as the conversation grows. This is because the models don't have memory, so the client needs to resend the entire conversation history each time, causing delays.
- With prompt caching, the previous prompts are stored in the session, so only the new prompt needs to be evaluated, making the chat much faster as discussions get longer.

I wanted to use BoltAI for everything since it's my favorite app. It supports any backend with the OpenAI-compatible API, but it doesn’t send the `cache_prompt` parameter, so it has the same issue as other clients. To fix this, I created a proxy in Ruby that forwards requests to the backend and always sets `cache_prompt` to `true`. It works great!

If you're interested, check out the project at https://github.com/vitobotta/openai-prompt-cache-proxy. There’s also a script that can automate setting up LaunchAgents to keep models in memory with Llama.cpp and to set up the proxies. It only takes a few seconds, and after that, you connect your client to the proxy instead of directly to Llama.cpp.

Now, I have a setup that works well and feels like a viable solution for using models locally for actual work, not just testing around. LMStudio is fantastic, but even with its built-in chat, it had the same issue with missing prompt caching. Most GUIs and tools are built on top of Llama.cpp, so I hope they'll add prompt caching support soon.
 
Does that prompt caching work with multiple simultaneous users or will it mix my prompts with someone else's? (maybe 5-10 potential users)
 
Currently, I'm happily on my MBP 14" M1 Pro 10c/16c/32GB/2TB purchased in 2022 as my first Mac device, which is most of the time in clamshell mode, considering going towards a Mac Mini M4 Pro for a real desktop machine and the MBP being fully for on-the-go or around the flat.

With the MBP, I also started to get into the local LLM topic last weekend, with latest LM Studio + one or another Youtube video on that topic. Pretty cool what's possible locally already. I think I have been using the MLX version of Llama 3.2 3B parameters and with that, I figured, with multiple chats (mix of textual + coding chats) towards the model, I can get into orange memory pressure sooner than I thought, beside having other "normal" things like a few Safari tabs etc. open, but e.g. no IDE, docker container at that time from my dev use case.

This gives me confidence that my initially already considered 64GB RAM is a must have, thus my initially considered Mac Mini M4 Pro 14c/20c/64GB/2TB still seems to be right one, not yet ordered though.
 
  • Like
Reactions: Alameda
Currently, I'm happily on my MBP 14" M1 Pro 10c/16c/32GB/2TB purchased in 2022 as my first Mac device, which is most of the time in clamshell mode, considering going towards a Mac Mini M4 Pro for a real desktop machine and the MBP being fully for on-the-go or around the flat.

With the MBP, I also started to get into the local LLM topic last weekend, with latest LM Studio + one or another Youtube video on that topic. Pretty cool what's possible locally already. I think I have been using the MLX version of Llama 3.2 3B parameters and with that, I figured, with multiple chats (mix of textual + coding chats) towards the model, I can get into orange memory pressure sooner than I thought, beside having other "normal" things like a few Safari tabs etc. open, but e.g. no IDE, docker container at that time from my dev use case.

This gives me confidence that my initially already considered 64GB RAM is a must have, thus my initially considered Mac Mini M4 Pro 14c/20c/64GB/2TB still seems to be right one, not yet ordered though.
If you really want to enjoy working with LLMs and have the budget, getting more than 64GB of memory would be ideal, which means waiting for the Studio version instead of settling for the Mini. Honestly, I believe that models with around 32 billion parameters are quite impressive already. With 64GB, I can run two Q4 versions along with a couple of smaller models simultaneously without running into any problems. So, if you plan to use local models regularly, 64GB should be your minimum requirement.
 
If you really want to enjoy working with LLMs and have the budget, getting more than 64GB of memory would be ideal, which means waiting for the Studio version instead of settling for the Mini. Honestly, I believe that models with around 32 billion parameters are quite impressive already. With 64GB, I can run two Q4 versions along with a couple of smaller models simultaneously without running into any problems. So, if you plan to use local models regularly, 64GB should be your minimum requirement.
Definitely interested in working with LLMs. Could I just ask, would 128GB be a huge advantage of 64GB?
 
  • Like
Reactions: mainemini
Definitely interested in working with LLMs. Could I just ask, would 128GB be a huge advantage of 64GB?
When it comes to LLMs and AI, having more memory is really beneficial. With extra memory, you can either run stronger models or multiple models simultaneously. I'm quite content with my Pro model that has 64GB of RAM, but when I upgrade next time, I'll definitely consider getting the Max version if my budget allows for it, to get as much memory as possible.
 
  • Like
Reactions: JSRinUK
Anyone else running LLMs locally on Apple Silicon Macs?
I hadn’t really heard much about LLMs until recently, when I watched some YouTube videos showing what the M4 MBP was capable of. Over the last week or so, I’ve been dabbling with Ollama and various LLMs.

It amuses me that the M4Pro doesn’t break a sweat on anything else I do, but running the LLMs has the CPU temperature up and the fans on.

Originally I believed I’d overspecced and bought a MacBook that was wayyy too much laptop for me. Now I’m glad I did. Although 24GB is not what I’d call comfortable for playing with LLMs, it does at least allow me to put a foot in the lake and get my ankles wet. If I ever wanted to do LLMs for serious purposes, I would definitely re-spec a machine with much, much, much more RAM (and better cooling, probably).

For now, the occasional temperature rise and fan noise at least allows me to brainstorm my stories with a local LLM without having to limit myself to the smallest models.
 
For the price paid for the higher spec MM, wouldn't you have been better off waiting for the M4 Max Studio?

It's a genuine question as I'm in the same boat. Was considering a high spec M4 Pro Mac Mini, but on the other hand for just a little extra and some patience, I could pick up a base M4 Max Studio...... I'm currently back and forth on what to do!! lol
 
I'm rocking a 48 gb m4 max model for now. Playing around with LLMs for the first time (I know, right, where've I been the last few years?). I love it, can't get enough of it!

Might have to upgrade to 128 now knowing that these and larger models exist. I'd like to try out the FP16 version of qwen2.5-coder-32b, 66 gig download :eek:...

Thanks for pointing me to LM Studio. I like it a lot, especially since it retains the model in memory between requests. That seems to speed things up quite a bit compared with ollama I've been using. Can't find the base models in LM Studio, though (for fill-in-the-middle code completion...doesn't work right with the instruct models). I have them in ollama but then that means I have to run both ollama and LM Studio.

Are you using an IDE with it? So far I seem to have settled on VS Code with Continue extension, although Cursor looks pretty good too (just don't know how to configure it for a local model)
Wow, that sounds really great. You guys seem to be on the bleeding edge when it comes to AI.

Unfortunately I am more of a consumer type and my current knowhow is limited when it comes to LLMs. But maybe you could point me in the right direction, here is what I would like to do:

I have collected many many .pdf files and many e-books in various formats over the years in an attempt to find specific, valuable information. This did go only so far, since I can not read every pdf I have collected. Somewhere in that data vault of mine is pure gold when it comes to certain research topics. My mission is to find this gold with AI.
Now, is there a software out there, which I can fill up with my own data and run AI over it to find certain details out of 1000s of pdf files? What software would you use, did you come across something like that? I would really love to start feeding my own AI project on a disconnected, database run AI Mac mini.

#1 Do you have any idea to start?

#2 Assuming I will not be able to do this with my ancient Mac Pro 5.1, what hardware will I need to do this?

Thanks in advance,

Alex
 
  • Like
Reactions: mainemini
I hadn’t really heard much about LLMs until recently, when I watched some YouTube videos showing what the M4 MBP was capable of. Over the last week or so, I’ve been dabbling with Ollama and various LLMs.

It amuses me that the M4Pro doesn’t break a sweat on anything else I do, but running the LLMs has the CPU temperature up and the fans on.

Originally I believed I’d overspecced and bought a MacBook that was wayyy too much laptop for me. Now I’m glad I did. Although 24GB is not what I’d call comfortable for playing with LLMs, it does at least allow me to put a foot in the lake and get my ankles wet. If I ever wanted to do LLMs for serious purposes, I would definitely re-spec a machine with much, much, much more RAM (and better cooling, probably).

For now, the occasional temperature rise and fan noise at least allows me to brainstorm my stories with a local LLM without having to limit myself to the smallest models.

With my M4 Pro mini the fans do kick in but only after a while of sustained use of LLMs, at least from what I have noticed.

For the price paid for the higher spec MM, wouldn't you have been better off waiting for the M4 Max Studio?

It's a genuine question as I'm in the same boat. Was considering a high spec M4 Pro Mac Mini, but on the other hand for just a little extra and some patience, I could pick up a base M4 Max Studio...... I'm currently back and forth on what to do!! lol

To be honest, I've been thinking about it quite a bit. If I had the funds, upgrading to the Studio would definitely be an option on my radar. Yet, here's the thing: LLMs are getting more efficient at a pretty impressive rate these days. A smaller model with fewer parameters is now capable of handling tasks that used to require much larger models just a few months back. So, there's a good chance I'll manage to run better models using the configuration I currently have as we look ahead.

But then again, part of me can't help but ponder if maybe I should've waited for the Studio. I was feeling just a tad impatient at that time.

Wow, that sounds really great. You guys seem to be on the bleeding edge when it comes to AI.

Unfortunately I am more of a consumer type and my current knowhow is limited when it comes to LLMs. But maybe you could point me in the right direction, here is what I would like to do:

I have collected many many .pdf files and many e-books in various formats over the years in an attempt to find specific, valuable information. This did go only so far, since I can not read every pdf I have collected. Somewhere in that data vault of mine is pure gold when it comes to certain research topics. My mission is to find this gold with AI.
Now, is there a software out there, which I can fill up with my own data and run AI over it to find certain details out of 1000s of pdf files? What software would you use, did you come across something like that? I would really love to start feeding my own AI project on a disconnected, database run AI Mac mini.

#1 Do you have any idea to start?

#2 Assuming I will not be able to do this with my ancient Mac Pro 5.1, what hardware will I need to do this?

Thanks in advance,

Alex

I believe what you're after is something called RAG. To be honest, I haven't dabbled in that field yet, but my understanding is that it's all about handling a bunch of documents more efficiently without needing to fine-tune the model. However, I'm not completely certain if this is 100% accurate.
 
For the price paid for the higher spec MM, wouldn't you have been better off waiting for the M4 Max Studio?
This is what I'm waiting for. I bought a 16 MBP with the 20 core GPU M4 Pro and it's about exactly half the speed as my 60 core M2 Ultra in LM Studio (which is impressive) with Qwen 2.5 14B and Gemma 2 27B

That is to say I expect the 40 core M4 Max to be on par with the M2 Ultra in LM Studio (I haven't personally tested it, but it's a safe assumption IMO)

However the fan noise gets ridiculous. The difference between the MBP and the Studio is night and day. I can constantly generate on the Studio and it stays silent + cool. My wallet is ready for the M4 Ultra, Timmy!
 
  • Like
Reactions: Ifti
Ugh... you've really got me thinking more about the studio now :D I'll see how things go when it comes out. If my M4 Pro mini, which I have had for only a few weeks, still has good value by then, I might think about upgrading. Fingers crossed.
 
With my M4 Pro mini the fans do kick in but only after a while of sustained use of LLMs, at least from what I have noticed.
I’m currently using gemma2 which is a bit of a stretch for my 24GB, but it does work.

Also trying out AI image generation - now that really does get the fans running, and the battery draining pretty fast, but it does work. Amazing machine.
 
I’m currently using gemma2 which is a bit of a stretch for my 24GB, but it does work.

Also trying out AI image generation - now that really does get the fans running, and the battery draining pretty fast, but it does work. Amazing machine.

Since my last posts, I've made a change from Llama.cpp to Ollama, primarily because it's more straightforward and now includes prompt caching.

For my tasks, I'm currently using these models:

- Virtuoso Small: This is an enhanced 14b model developed from Supernova Medius (which itself is an improved version of Qwen2.5 14b) along with other models. It works great for general tasks such as improving, summarizing, or translating text.

- Qwen2.5 Coder 14b: This is used mainly for code refactoring.

- Qwen2.5 Coder 3b: Perfect for code autocompletion in VSCode with the Continue.dev extension.

These models meet my needs about 90% of the time. However, when I'm not satisfied with the results, I opt for the 32b versions of the Qwen models. If that doesn't cut it, I switch to hosted models via OpenRouter, such as Llama 3.3 70b or Claude Sonnet 3.5.

I have a M4 Pro mini equipped with 64 GB of RAM. The 32b versions of the Qwen models run fine at decent speeds (around 11 tokens/second), but I stick to the 14b versions by default because they're twice as fast.

Which models do you use for AI generation?
 
  • Like
Reactions: eva2000
Since my last posts, I've made a change from Llama.cpp to Ollama, primarily because it's more straightforward and now includes prompt caching.

For my tasks, I'm currently using these models:

- Virtuoso Small: This is an enhanced 14b model developed from Supernova Medius (which itself is an improved version of Qwen2.5 14b) along with other models. It works great for general tasks such as improving, summarizing, or translating text.

- Qwen2.5 Coder 14b: This is used mainly for code refactoring.

- Qwen2.5 Coder 3b: Perfect for code autocompletion in VSCode with the Continue.dev extension.

These models meet my needs about 90% of the time. However, when I'm not satisfied with the results, I opt for the 32b versions of the Qwen models. If that doesn't cut it, I switch to hosted models via OpenRouter, such as Llama 3.3 70b or Claude Sonnet 3.5.

I have a M4 Pro mini equipped with 64 GB of RAM. The 32b versions of the Qwen models run fine at decent speeds (around 11 tokens/second), but I stick to the 14b versions by default because they're twice as fast.

Which models do you use for AI generation?
I’m new to all of this at the moment, so I’m being led by Alex Ziskind on YouTube. I set up StableDiffusion from his video here:
and I’ve tried a selection of different models from the site he looks at in the video. I don’t understand all of it (he’s a bit quick) but it’s a starting point from which I’m eager to learn more.

Following his tutorial, I can now get Ollama to create an image based on a prompt - which I’m finding might be quite useful because I’m using LLM to brainstorm my story and it’ll be handy to have an image depicting the scene in question.

With LLM, I initially tried various models from 3.2B to 14B but then I took the plunge and loaded gemma2:27b. A 27B LLM is a bit weighty for my 24GB RAM and it can take a short while to respond to an in-depth query, but it does work. I think it produces better results when brainstorming my story. Not as good as ChatGPT, but definitely better than some of the smaller models. Do you think I should try Virtuoso Small?
 
So a beginner question if you don't mind- I work with medical records so considering a local LLM for the privacy advantages. I'm told if I set one up and do LoRA training on my 50,000 1-2 page old documents that I can get the LLM to rewrite clunky AI scribe notes into something much closer to my note writing style.

I'm trying to decide if a new M4 mini or an used/refurbished M1 or M2 Mac Studio would be capable of that, presumably via LMStudio. How big of an LLM model, how much RAM and GPUs are worthwhile for that amount and type of AI work?

We're buying new minis for the office anyway so I simply would up the specs on one of them or replace one with an M1/2 Studio. Suggestions?
 
Last edited:
  • Like
Reactions: AlexMaximus
Can I sneak into the thread? I’m genuinely interested in this topic.

See, now that I’m an owner of an M4 Mac, I’m tempted to test the waters of local LLMs. I bought the same machine as OP, the M4 Pro with 24GB of RAM. However, I’m switching to an M4 with 32GB of RAM, because I don’t really use the GPU that much, and upgrading storage and RAM on the M4 Pro is painful (for my pocket).

Which machine would you purchase, if local LLMs weren’t a high priority, but you wanted to have enough resources to move decent models? The M4 Pro with 24GB, or the M4 with 32GB?

From what I’ve read, the main resource LLMs demand is RAM, although the 24GB on the M4 Pro are much faster (275GB/s) than the 32GB on the M4 (120GB/s), but slower than VRAM in some dedicated GPUs so…

Ideally, I could install an LLM on this M4 Pro this holidays and, once I return it for the M4 with 32GB, see how that exact model works with an identical test. But honestly I don’t even know which model or how to install it on my Mac. And right now I don’t have much time to research about it.

Which one would be more suited in the long term to play around with medium sized local LLMs? M4 Pro with 24GB or M4 with 32GB?
 
  • Like
Reactions: mainemini
So a beginner question if you don't mind- I work with medical records so considering a local LLM for the privacy advantages. I'm told if I set one up and do LoRA training on my 50,000 1-2 page old documents that I can get the LLM to rewrite clunky AI scribe notes into something much closer to my note writing style.

I'm trying to decide if a new M4 mini or an used/refurbished M1 or M2 Mac Studio would be capable of that, presumably via LMStudio. How big of an LLM model, how much RAM and GPUs are worthwhile for that amount and type of AI work?

We're buying new minis for the office anyway so I simply would up the specs on one of them or replace one with an M1/2 Studio. Suggestions?
I have no knowledge of medical records, so I can't offer much help on this one. And I'm very new when it comes to LLM.

When running the LLM, I have the Activity Monitor open and the GPU history and the CPU history panels open.

When the LLM is constructing a reply, the GPU resources shoot up to max (they drop as soon as the LLM has done "thinking"). Also, I see the temperature of the efficiency cores rise and the fans may start to kick in depending on the complexity of the prompt.

The RAM size is, I believe, more down to how large a model you want to load up. My 24GB will just about handle the 27B gemma2 model (and some swapping goes on), but I wouldn't push it any further.

So, from my naive perspective, I would say you need as much RAM as you can get if you're going to load up a large LLM; and the number of GPU cores for speed of response.

Can I sneak into the thread? I’m genuinely interested in this topic.

See, now that I’m an owner of an M4 Mac, I’m tempted to test the waters of local LLMs. I bought the same machine as OP, the M4 Pro with 24GB of RAM. However, I’m switching to an M4 with 32GB of RAM, because I don’t really use the GPU that much, and upgrading storage and RAM on the M4 Pro is painful (for my pocket).

Which machine would you purchase, if local LLMs weren’t a high priority, but you wanted to have enough resources to move decent models? The M4 Pro with 24GB, or the M4 with 32GB?

From what I’ve read, the main resource LLMs demand is RAM, although the 24GB on the M4 Pro are much faster (275GB/s) than the 32GB on the M4 (120GB/s), but slower than VRAM in some dedicated GPUs so…

Ideally, I could install an LLM on this M4 Pro this holidays and, once I return it for the M4 with 32GB, see how that exact model works with an identical test. But honestly I don’t even know which model or how to install it on my Mac. And right now I don’t have much time to research about it.

Which one would be more suited in the long term to play around with medium sized local LLMs? M4 Pro with 24GB or M4 with 32GB?
Until I started looking at LLMs and image generation, my MBP Pro (unbinned M4 Pro / 24GB / 1TB) barely even began to break a sweat with anything I did with it. If I was going into this specifically to focus on LLMs and/or image generation, though, I'd definitely be saving up for more RAM. If you're looking at the M4 vs M4 Pro, I do wonder if memory bandwidth might be a factor (I don't know how big a factor, but I am glad I have the Pro).

Personally, if I was looking for a machine to do some decent size LLM work on, I don't believe the relatively small difference from 24GB to 32GB would be worth it. I'd be going for something with much more RAM. With either 24GB or 32GB RAM, I suspect the more GPU cores on the M4 Pro and the faster memory bandwidth over the M4 might offer a better experience.

But, as I've said, I'm incredibly new at this so don't take my advice as definitive. Check out some of Alex Ziskind's YouTube videos on the subject. He compares various machines in his LLM videos, and the results are interesting. My take on it is you're either going to be testing the waters on an affordable machine (like me), or you're going to "go large / spend big" to get the best results that you can.
 
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.