RepoWatch / GitHub signal
llama.cpp b9603 and ggml v0.15.0 updates
CUDA concat support and version bumps in core inference stack.
Directly impacts local model serving in OpenClaw, Hermes agents, and any self-hosted inference work.
What changed
- llama.cpp: Release b9603 (2026-06-12). Commit adds support for concat on scalar types at CUDA backend.
- ggml: Release v0.15.0 (2026-06-11). Version bump commit.
Also tracked related commits in unsloth, ollama, tinygrad in the same category.
Why it matters
These form the foundation for efficient local LLM inference. The CUDA improvements target better performance on NVIDIA GPUs, which is relevant for any production local stacks or optimisation work in Foundry tooling.
My read
Incremental but targeted updates. The ggml CUDA change is the most interesting for inference optimisation. Releases indicate active maintenance across the local inference ecosystem. Not revolutionary, but steady progress worth monitoring.
Bottom line
Worth a spike. Update now for the latest CUDA support if running llama.cpp based setups. Links: https://github.com/ggml-org/llama.cpp/releases/tag/b9603 and https://github.com/ggml-org/ggml/releases/tag/v0.15.0