You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.
Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
show comments
hackerman70000
The real story here isn't Gemma 4 specifically, it's that the harness and the model are now fully decoupled. Claude Code, OpenCode, Pi, Codex all work with any backend. The coding agent is becoming a commodity layer and the competition is moving to model quality and cost. Good for users, bad for anyone whose moat was the harness
show comments
trvz
ollama launch claude --model gemma4:26b
show comments
martinald
Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.
show comments
vbtechguy
Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
show comments
edinetdb
Claude Code has become my primary interface for iterating on data pipeline work —
specifically, normalizing government regulatory filings (XBRL across three different
accounting standards) and exposing them via REST and MCP.
The MCP piece is where the workflow gets interesting. Instead of building a client
that calls endpoints, you describe tools declaratively and the model decides when to
invoke them. For financial data this is surprisingly effective — a query like
"compare this company's leverage trend to sector peers over 10 years" gets
decomposed automatically into the right sequence of tool calls without you hardcoding
that logic.
One thing I haven't seen discussed much: tool latency sensitivity is much higher in
conversational MCP use than in batch pipelines. A 2s tool response feels fine in a
script but breaks conversational flow. We ended up caching frequently accessed tables
in-memory (~26MB) to get sub-100ms responses. Have you noticed similar thresholds
where latency starts affecting the quality of the model's reasoning chain?
show comments
drob518
Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.
show comments
ttul
I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.
show comments
jonplackett
So wait what is the interaction between Gemma and Claude?
show comments
asymmetric
Is a framework desktop with >48GB of RAM a good machine to try this out?
show comments
Imanari
How well do the Gemma 4 models perform on agentic coding? What are your impressions?
janalsncm
Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.
AbuAssar
omlx gives better performance than ollama on apple silicon
Someone1234
Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.
show comments
jedisct1
Running Gemma 4 with llama.cpp and Swival:
$ llama-server
--reasoning auto
--fit on
-hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL
--temp 1.0 --top-p 0.95 --top-k 64
$ uvx swival --provider llamacpp
Done.
aetherspawn
Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?
Why/why not?
tiku
I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.
inzlab
awesome, the lighter the hardware running big softwares the more novelty.
NamlchakKhandro
I don't know why people bother with Claude code.
It's so jank, there are far superior cli coding harness out there
show comments
smcleod
Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.
You can use llama.cpp server directly to serve local LLMs and use them in Claude Code or other CLI agents. I’ve collected full setup instructions for Gemma4 and other recent open-weight LLMs here, tested on my M1 Max 64 GB MacBook:
https://pchalasani.github.io/claude-code-tools/integrations/...
The 26BA4B is the most interesting to run on such hardware, and I get nearly double the token-gen speed (40 tok/s) compared to Qwen3.5 35BA3B. However the tau2 bench results[1] for this Gemma4 variant lag far behind the Qwen variant (68% vs 81%), so I don’t expect the former to do well on heavy agentic tool-heavy tasks:
[1] https://news.ycombinator.com/item?id=47616761
Local models are finally starting to feel pleasant instead of just "possible." The headless LM Studio flow is especially nice because it makes local inference usable from real tools instead of as a demo.
Related note from someone building in this space: I've been working on cloclo (https://www.npmjs.com/package/cloclo), an open-source coding agent CLI, and this is exactly the direction I'm excited about. It natively supports LM Studio, Ollama, vLLM, Jan, and llama.cpp as providers alongside cloud models, so you can swap between local and hosted backends without changing how you work.
Feels like we're getting closer to a good default setup where local models are private/cheap enough to use daily, and cloud models are still there when you need the extra capability.
The real story here isn't Gemma 4 specifically, it's that the harness and the model are now fully decoupled. Claude Code, OpenCode, Pi, Codex all work with any backend. The coding agent is becoming a commodity layer and the competition is moving to model quality and cost. Good for users, bad for anyone whose moat was the harness
Just FYI, MoE doesn't really save (V)RAM. You still need all weights loaded in memory, it just means you consult less per forward pass. So it improves tok/s but not vram usage.
Here is how I set up Gemma 4 26B for local inference on macOS that can be used with Claude Code.
Claude Code has become my primary interface for iterating on data pipeline work — specifically, normalizing government regulatory filings (XBRL across three different accounting standards) and exposing them via REST and MCP.
The MCP piece is where the workflow gets interesting. Instead of building a client that calls endpoints, you describe tools declaratively and the model decides when to invoke them. For financial data this is surprisingly effective — a query like "compare this company's leverage trend to sector peers over 10 years" gets decomposed automatically into the right sequence of tool calls without you hardcoding that logic.
One thing I haven't seen discussed much: tool latency sensitivity is much higher in conversational MCP use than in batch pipelines. A 2s tool response feels fine in a script but breaks conversational flow. We ended up caching frequently accessed tables in-memory (~26MB) to get sub-100ms responses. Have you noticed similar thresholds where latency starts affecting the quality of the model's reasoning chain?
Seems like this might be a great way to do web software testing. We’ve had Selenium and Puppeteer for a long time but they are a bit brittle with respect to the web design. Change something about the design and there’s a high likelihood that a test will break. Seems like this might be able to be smarter about adapting to changes. That’s also a great use for a smaller model like this.
I could see a future in which the major AI labs run a local LLM to offload much of the computational effort currently undertaken in the cloud, leaving the heavy lifting to cloud-hosted models and the easier stuff for local inference.
So wait what is the interaction between Gemma and Claude?
Is a framework desktop with >48GB of RAM a good machine to try this out?
How well do the Gemma 4 models perform on agentic coding? What are your impressions?
Qwen3-coder has been better for coding in my experience and has similar sizes. Either way, after a bunch of frustration with the quality and price of CC lately I’m happy there are local options.
omlx gives better performance than ollama on apple silicon
Using Claude Code seems like a popular frontend currently, I wonder how long until Anthropic releases an update to make it a little to a lot less turn-key? They've been very clear that they aren't exactly champions of this stuff being used outside of very specific ways.
Running Gemma 4 with llama.cpp and Swival:
$ llama-server --reasoning auto --fit on -hf unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL --temp 1.0 --top-p 0.95 --top-k 64
$ uvx swival --provider llamacpp
Done.
Can you use the smaller Gemma 4B model as speculative decoding for the larger 31B model?
Why/why not?
I hate that my M5 with 24 gb has so much trouble with these models. Not getting any good speeds, even with simple models.
awesome, the lighter the hardware running big softwares the more novelty.
I don't know why people bother with Claude code.
It's so jank, there are far superior cli coding harness out there
Did you try the MLX model instead? In general MLX tends provide much better performance than GGUF/Llama.cpp on macOS.