> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:
EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:
I have used omlx.ai with great success to both download multiple mlx models (including gemma and qwen) suited for my hardware AND to be able to automagically launch both open-source and close-source (claude code, codex) harnesses using these models. All from a web or desktop UI
You would not need to follow a blog post with omlx IMHO
show comments
ljosifov
For high Ram (unified), and relatively middling to lowish Tflops and bandwidth GB/s, usually MoEs are most hopeful. The current top-1 in the (iq, tok/s, @ context depth) ranks for me (M2 Max, 96gb) is DeepSeek-V4-Flash REAP25 <65gb gguf + ds4-server + pi agent. Not better than cloud API ofc, but useful enough to endure if I need to. E.g on a non-Internet 4h flight the battery (local llm draws 60w) held long enough. REAP supporting ds4 branch here
DS4F dropping to unusable <10 tok/s only at 784K context (!!) makes a big difference.
jumploops
I've been quite impressed with DeepSeek v4 Flash running via antirez's ds4[0].
It feels like a GPT-4 class model in terms of "stored knowledge" but is better at long-horizon tool calling than any of the GPT-4 class models.
Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and ~200 t/s on prefill. I was expecting it to feel slow, and it certainly does when e.g. generating code, but it's surprisingly useful as a "machine orchestrator" for simple tasks.
For non-agentic usecases, it's a decent enough model to converse with, and has the benefit of being entirely self-contained/private.
I cannot wait until a time in the future when we have local models that are Opus 4.6+ level, and capable of running on inexpensive hardware like a 16Gb Mac. Hopefully that's only a few years away.
dofm
Useful stuff in here that I wish I'd seen a few days ago :-)
I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.
Fiddling about with local models has done so much for my conceptual understanding of what is going on.
FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.
Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.
show comments
d4rkp4ttern
It’s relatively simple to use llama.cpp/server to spin up a local LLM to work with Claude Code or Codex-CLI. The required llama server settings are often scattered all over so I maintain a set of instructions here for several popular open LLMs:
FYI you can open Claude code in the terminal, point it at this article and just tell it to "do it", if you're feeling extra lazy
show comments
alexwwang
I wonder if these local model could really solve problems especially for users that aren’t experts on a given coding language. I am not sure that, more than inline auto completion and unit implementation, are these model capable of designing and composing tech specs that really work.
reddit_clone
>64 GB
Thats the rub.
I have an M4 with 48G. I wonder if it is worth testing this out.
My past attempts (with Ollama and various LLMs) were too slow to use.
One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).
mark_l_watson
Nice writeup, thanks.
I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.
hmontazeri
I use LM Studio with the local server it ships and connect it to opencode. Takes 2 min to setup
anigbrowl
This video is realtime. And shows the agent responding at a perfectly usable speed.
Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.
show comments
reenorap
My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?
show comments
smetannik
I wonder why something like LM Studio didn't work for the author?
show comments
namnnumbr
oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.
show comments
bicepjai
I assumed lmstudio is the obvious choice after ollama. Is there a reason lmstudio is not used widely ?
show comments
koliber
How much RAM did the local machine have?
cdolan
Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this
show comments
everlier
You can also install Harbor and then it's:
harbor up omlx opencode
attogram
8b max on a std 16gb macbook. Anything more and your mac is toast
show comments
rectang
Does anybody run a local agent on a Mac using an outboard GPU?
show comments
metadaemon
Has anyone compared a setup like this to just using LM Studio?
show comments
LoganDark
I poured a couple days into custom Burn inference for Qwen3-Coder-Next only to find it doesn't come with a speculative decoder, so on my M4 Max I can't push it much further than 120t/s. That's still kinda slow, though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I worry that will compromise the quality somewhat, but might give it a try to see if I can get more usable speeds.
sleepybrett
or you can just load up ollama, have it load a local model and point claude or opencode at it...
is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp
show comments
k2enemy
Grammar note:
When used as a verb, it should be "set up," and when used as a noun, "setup."
> The benchmark prompt was:
> Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.
> Each benchmark generated about 128 tokens.
Generating 128 tokens is probably not enough for good benchmark results. MTP speedup depends on how often the predicted tokens are accepted. In my experience, the very early output has a higher acceptance rate, so short testing can give false positive speedups.
llama.cpp includes a tool specifically for benchmarking that will sweep the arguments for you so you don't have to restart the server and send it prompts:
https://github.com/ggml-org/llama.cpp/blob/master/tools/llam...
EDIT: Also the section about downloading the models should have mentioned that llama.cpp has a "-hf" argument that will download the models for you. I appreciate the author for sharing their experience, but for beginners this might not be the best guide to use.
I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/
Why not just get Claude to set it up?
Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:
I have used omlx.ai with great success to both download multiple mlx models (including gemma and qwen) suited for my hardware AND to be able to automagically launch both open-source and close-source (claude code, codex) harnesses using these models. All from a web or desktop UI
You would not need to follow a blog post with omlx IMHO
For high Ram (unified), and relatively middling to lowish Tflops and bandwidth GB/s, usually MoEs are most hopeful. The current top-1 in the (iq, tok/s, @ context depth) ranks for me (M2 Max, 96gb) is DeepSeek-V4-Flash REAP25 <65gb gguf + ds4-server + pi agent. Not better than cloud API ofc, but useful enough to endure if I need to. E.g on a non-Internet 4h flight the battery (local llm draws 60w) held long enough. REAP supporting ds4 branch here
https://github.com/ljubomirj/ds4/tree/reap-compact-support
DS4F dropping to unusable <10 tok/s only at 784K context (!!) makes a big difference.
I've been quite impressed with DeepSeek v4 Flash running via antirez's ds4[0].
It feels like a GPT-4 class model in terms of "stored knowledge" but is better at long-horizon tool calling than any of the GPT-4 class models.
Running on a 128GB MBP M4 Max, I'm getting ~24 t/s on generation and ~200 t/s on prefill. I was expecting it to feel slow, and it certainly does when e.g. generating code, but it's surprisingly useful as a "machine orchestrator" for simple tasks.
For non-agentic usecases, it's a decent enough model to converse with, and has the benefit of being entirely self-contained/private.
[0]https://github.com/antirez/ds4
I cannot wait until a time in the future when we have local models that are Opus 4.6+ level, and capable of running on inexpensive hardware like a 16Gb Mac. Hopefully that's only a few years away.
Useful stuff in here that I wish I'd seen a few days ago :-)
I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.
Fiddling about with local models has done so much for my conceptual understanding of what is going on.
FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.
Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.
It’s relatively simple to use llama.cpp/server to spin up a local LLM to work with Claude Code or Codex-CLI. The required llama server settings are often scattered all over so I maintain a set of instructions here for several popular open LLMs:
https://pchalasani.github.io/claude-code-tools/integrations/...
FYI you can open Claude code in the terminal, point it at this article and just tell it to "do it", if you're feeling extra lazy
I wonder if these local model could really solve problems especially for users that aren’t experts on a given coding language. I am not sure that, more than inline auto completion and unit implementation, are these model capable of designing and composing tech specs that really work.
>64 GB
Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.
My past attempts (with Ollama and various LLMs) were too slow to use.
Here's a visual post for using LM Studio and VS Code (and Pi): https://blog.alexewerlof.com/p/local-llms-for-agentic-coding
One way or another local AI is the future. I actually find weaker models more interesting because it keeps me sharp (at the cost of velocity of course).
Nice writeup, thanks.
I run something very similar except for directly using pi as the agentic harness I use little-coder that wraps pi with reasonable defaults for running local models. Even though my local setup is a bit slow, it is a thrill to do real work completely locally.
I use LM Studio with the local server it ships and connect it to opencode. Takes 2 min to setup
This video is realtime. And shows the agent responding at a perfectly usable speed.
Alas, this video appears not have been linked to the text that describes it. Perhaps I should ask an AI to generate an artistic rendering of the author's description.
My biggest pet peeve with all these articles on local AI is the only thing they talk about is tokens per second. No one mentions the quality of the answers. No one. I don't mind waiting a little longer if the quality is better. Quickly serving me slop doesn't make it more useful. Are people really only looking at tokens per second?
I wonder why something like LM Studio didn't work for the author?
oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.
I assumed lmstudio is the obvious choice after ollama. Is there a reason lmstudio is not used widely ?
How much RAM did the local machine have?
Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this
You can also install Harbor and then it's:
harbor up omlx opencode
8b max on a std 16gb macbook. Anything more and your mac is toast
Does anybody run a local agent on a Mac using an outboard GPU?
Has anyone compared a setup like this to just using LM Studio?
I poured a couple days into custom Burn inference for Qwen3-Coder-Next only to find it doesn't come with a speculative decoder, so on my M4 Max I can't push it much further than 120t/s. That's still kinda slow, though still faster than llama.cpp's 70.9t/s and MLX's 80.6t/s with the same model. Claude Fable 5 is recommending I use the Qwen3 MTP -- I worry that will compromise the quality somewhat, but might give it a try to see if I can get more usable speeds.
or you can just load up ollama, have it load a local model and point claude or opencode at it...
is this article old? It's not. I'm not sure why he went through all the bother of llama.cpp
Grammar note:
When used as a verb, it should be "set up," and when used as a noun, "setup."
Other examples (verb, noun):
log in, login
back up, backup
shut down, shutdown
break down, breakdown
warm up, warmup