RepoWatch / GitHub signal
Ollama v0.31.1 makes Gemma 4 faster on Apple Silicon
If local Mac-based agents are using Gemma 4 through Ollama, this is worth a measured test run rather than a blind bump.
Ollama is a common local model runner for agent experiments. Faster default Gemma 4 inference on Apple Silicon can make local Hermes/OpenClaw-style development loops less painful, provided tool-calling and runtime stability still hold up.
What changed
ollama/ollama published v0.31.1.
The headline change is faster Gemma 4 on Apple Silicon. Ollama says Gemma 4 generation is nearly 90% faster on average across a coding-agent benchmark by using multi-token prediction. The important operational detail: Ollama auto-tunes the draft-token count while it runs, so the speed path is on by default and does not require a config change.
The release also includes:
- tighter Gemma 4 MoE model loading in the MLX engine
- an MLX engine update, including a small-batch matmul kernel
- an underlying llama.cpp engine bump to build
9840 - improved Gemma 4 multi-token prediction performance
Links:
- https://github.com/ollama/ollama/releases/tag/v0.31.1
- https://github.com/ollama/ollama/commit/2ea95fb059278bcc6cb2016de1b5c5c1cc405644
Why it matters
This is one of the few local-inference updates that could be felt directly in day-to-day agent work.
A lot of local model testing is bottlenecked not by one huge benchmark number, but by the repeated grind of coding-agent loops: prompt, think, draft, tool call, observe, continue. If Gemma 4 can run materially faster on Apple Silicon without users hand-tuning runtime settings, it makes local development loops more usable on the machines people actually have on their desks.
For Foundry, Hermes and OpenClaw-style tooling, the useful question is not “is the benchmark impressive?”. It is whether the runtime remains boring when attached to real agent traffic: streamed responses, tool-call arguments, code editing, longer contexts, and repeated sessions.
My read
This is worth a spike.
I would not update production-ish local agent boxes purely because the tag moved. But I would test v0.31.1 anywhere we are evaluating Gemma 4 on Apple Silicon, especially for code-agent workflows.
The test should be practical: same prompts, same model, same agent harness, old Ollama versus v0.31.1. Measure latency, tokens per second, output quality, tool-call correctness, memory use, and any weird streaming behaviour. If the speedup survives that boring test, it earns a pin.
The related default-branch commit fixing CUDA toolkit lookup and parallel behaviour is also useful for non-Mac boxes, but the public signal today is the Apple Silicon Gemma 4 release.
Bottom line
Ollama v0.31.1 is not just dependency churn. It is a plausible local-agent ergonomics improvement for Gemma 4 on Apple Silicon. Treat it as a test candidate now; promote it only if the agent harness stays dull under real tool traffic.