Fun thread. 🙂
I've been running Qwen-Coder-Next 80B on my sort-of aging M1 Max MBP 64GB. A whole lot of playing around and a few bake-offs, with the intent of using via CLI and for vscode integration via Continue. I'm not 100% dialed in yet, but am running 64k context, and am pretty much at the bleeding edge - as someone else mentioned, I certainly had hit memory pressure red to the point of shutdown during some fine-tuning.
I'm running via locally configured and compiled llama.cpp as a launch service, serving local and for a few Linux dev VMs. Exact model: unsloth_Qwen3-Coder-Next-GGUF_Qwen3-Coder-Next-Q4_K_M.gguf
llama.cpp compiled with Metal support, etc.
Started down this path after being burned badly with Bolt (*super short* context it seems), and have worked in AI/ML for a handful of years (although not in LLM/generative space), so wanted to see where I could get to. This led to a long but useful rabbit hole of building my own agentic AI guardrail system, and reduced credit/token use by over 90% (no, it's not 'magic prompts' 😉 ), but also went looking at how far I could go locally.
Some interesting bits - I did compare vs Devstral/Mistral variants and a few others. QwenCoder-Next is an MOE model with selective hyperparam activation. The difference in response/tokens per second as well as time to response was a HUGE difference with Qwen3-Coder-Next coming out on top by quite a bit. The others simply weren't usable without screaming. Yes, I could have dropped down on the params and looked for a 30B (I only use 1.5B for code completion), but this is where I landed <right now> after a good amount of testing including code tests and reviews.
I probably could reduce my context size a bit, and increase max_tokens and prediction sizes.
In general, I use 'web ai' for planning discussions, finalize into a 'spec' I dump in the repo, then let either Codex or local model implement, or I implement part, then have it refer to the spec to complete and test. Sometimes I'll round robin and have web AI give me the local agent prompt and feedback ack/forth while jumping in if/when I see things going funny/wrong, etc.
Continue was kind of a huge pain to get sorted, so was vscode honestly. I'm a former sr sw engineer/architect, but moved into leadership roles some time back, so part is fun, but part is also me being a bit rusty, and I never preferred MS tools for well - anything at all, as a long-time Unix/Linux engineer. Did a few frustrating rounds of trying o get proper permissions enabled in Continue, while vscode config is literally *polluted* IMO with copilot settings, which was like a frigging virus - every time I thought I had banished it forever from my vscode, it found a way to pop back up like a virus. It didn't help with so MANY settings in vscode that sure sounded generic but were literally only for copilot, but got there, and will never touch/enable copilot in vscode again/it's banished forever for me. 😀
I also in general, refuse to shift to pay-per-token/API models at this point, as it's quite easy to see things jump pretty significantly. Of course, inevitable as someone said, pricing WILL go up, but if you're using paid models/plans, you do NOT need flagship Opus or GPT for 90% of most tasks. A buddy was burning around $1k/month, and a prior very senior sw engineering team I had managed was routinely running out of tokens. While some of what they're doing is pretty complex work, 90% of it could be done on Haiku, 5.4-mini or similar, and the price multipliers are HUGE on the flagship models. Right now, I use a paid GPT account for web AI, I keep a small paid plan on Bolt (always use your own project github, private is fine) for claude/haiku/sonnet/opus), and swap between codex (mostly 5.4-mini medium) and Qwen for coding tasks.
I know we're mostly talking about local, but ask your AI of choice or see if you can actually find a rate page for your brand model of choice - the difference in flagship vs 'pretty damned good' can be pretty crazy, even approaching 10x or more. Until OpenAI hosed me a bit (they seem to have dropped non-Spark 5.3-codex from the list in Codex app), I'd do 5.4-mini or Qwen, only occasionally bumping up to 5.3-codex. 5.4-mini is around 15% of GPT5, and still like 25% of 5.4 non-mini, while Codex is a bit less that 1/2 of the flagship, and slightly less than 5.4 non-mini. (not spark - they don't even tell you what it costs, which is hokey). So yeah, if you're going to use paid, consider which specific model to use when and your bank account and sanity will thank you.
Qwen3-coder-next generally performed at least as well as 5.4-mini and approaching 5.3-codex, occasionally but rarely slightly better, most of the times not-quite-as good. Single file/simple edits/patterns zero issues, but when you get into more complex tasks as in repo-wide refactors, this is where you want a higher-powered model or a detailed execution plan, possibly worked out with a web ai model with stronger planning/reasoning vs coding-focused agentic models.
So yeah, I now use Qwen and Codex almost interchangeably in practice, was doing Qwen for 90% of the work, but as I put my agent framework/guardrails in place and watched my credit usage plummet, it's a flip of the coin as to which I use. I probably need one more round of tuning on Qwen, but it's not locking up the system any more (yay 😉 ), although Codex remains faster vs the M1 Max.
Next step most likely is an M5 Studio 256GB if/when they announce it. I'll offload model use there, and keep the M1 Max MBP for a while longer as my travel system, and see what the 'mbp ultra' actually turns out to be. The M1 Max MBP is still an overall competent system - right up until I wanted to push the bleeding edge on RAM use with it. 😀