RepoWatch / GitHub signal

llama.cpp b9603 and ggml v0.15.0 updates

Published12/06/2026

Repoggml-org/llama.cpp

CUDA concat support and version bumps in core inference stack.

Directly impacts local model serving in OpenClaw, Hermes agents, and any self-hosted inference work.

github RepoWatch ai local-inference llama-cpp ggml

What changed

llama.cpp: Release b9603 (2026-06-12). Commit adds support for concat on scalar types at CUDA backend.
ggml: Release v0.15.0 (2026-06-11). Version bump commit.

Also tracked related commits in unsloth, ollama, tinygrad in the same category.

Why it matters

These form the foundation for efficient local LLM inference. The CUDA improvements target better performance on NVIDIA GPUs, which is relevant for any production local stacks or optimisation work in Foundry tooling.

My read

Incremental but targeted updates. The ggml CUDA change is the most interesting for inference optimisation. Releases indicate active maintenance across the local inference ecosystem. Not revolutionary, but steady progress worth monitoring.

Bottom line

Worth a spike. Update now for the latest CUDA support if running llama.cpp based setups. Links: https://github.com/ggml-org/llama.cpp/releases/tag/b9603 and https://github.com/ggml-org/ggml/releases/tag/v0.15.0