RepoWatch / GitHub signal

Ollama v0.31.1 makes Gemma 4 faster on Apple Silicon

Published01/07/2026

Repoollama/ollama

If local Mac-based agents are using Gemma 4 through Ollama, this is worth a measured test run rather than a blind bump.

Ollama is a common local model runner for agent experiments. Faster default Gemma 4 inference on Apple Silicon can make local Hermes/OpenClaw-style development loops less painful, provided tool-calling and runtime stability still hold up.

github RepoWatch ai tools

What changed

ollama/ollama published v0.31.1.

The headline change is faster Gemma 4 on Apple Silicon. Ollama says Gemma 4 generation is nearly 90% faster on average across a coding-agent benchmark by using multi-token prediction. The important operational detail: Ollama auto-tunes the draft-token count while it runs, so the speed path is on by default and does not require a config change.

The release also includes:

tighter Gemma 4 MoE model loading in the MLX engine
an MLX engine update, including a small-batch matmul kernel
an underlying llama.cpp engine bump to build 9840
improved Gemma 4 multi-token prediction performance

Links:

Why it matters

This is one of the few local-inference updates that could be felt directly in day-to-day agent work.

A lot of local model testing is bottlenecked not by one huge benchmark number, but by the repeated grind of coding-agent loops: prompt, think, draft, tool call, observe, continue. If Gemma 4 can run materially faster on Apple Silicon without users hand-tuning runtime settings, it makes local development loops more usable on the machines people actually have on their desks.

For Foundry, Hermes and OpenClaw-style tooling, the useful question is not “is the benchmark impressive?”. It is whether the runtime remains boring when attached to real agent traffic: streamed responses, tool-call arguments, code editing, longer contexts, and repeated sessions.

My read

This is worth a spike.

I would not update production-ish local agent boxes purely because the tag moved. But I would test v0.31.1 anywhere we are evaluating Gemma 4 on Apple Silicon, especially for code-agent workflows.

The test should be practical: same prompts, same model, same agent harness, old Ollama versus v0.31.1. Measure latency, tokens per second, output quality, tool-call correctness, memory use, and any weird streaming behaviour. If the speedup survives that boring test, it earns a pin.

The related default-branch commit fixing CUDA toolkit lookup and parallel behaviour is also useful for non-Mac boxes, but the public signal today is the Apple Silicon Gemma 4 release.

Bottom line

Ollama v0.31.1 is not just dependency churn. It is a plausible local-agent ergonomics improvement for Gemma 4 on Apple Silicon. Treat it as a test candidate now; promote it only if the agent harness stays dull under real tool traffic.