RepoWatch / GitHub signal
llama.cpp b9859 adds precompiled OpenCL kernels for Adreno
This is not a blanket upgrade signal, but it is worth testing if local agent inference is moving onto Snapdragon/Adreno hardware.
llama.cpp sits underneath a lot of local model tooling. Better OpenCL support on Adreno matters for edge boxes, portable Windows-on-Arm machines, and any agent stack trying to escape the cloud/GPU rental treadmill.
What changed
ggml-org/llama.cpp published build b9859, pointing at commit 4fc4ec5.
The useful bit is the OpenCL change: llama.cpp can now load precompiled binary kernels from a Qualcomm Adreno kernel library when built with:
-DGGML_OPENCL_USE_ADRENO_BIN_KERNELS=ON
The docs say the prebuilt library currently targets Snapdragon X2 Adreno GPUs and includes kernels for MUL_MAT_ID paths including Q4_0, Q4_1, Q4_K and MXFP4.
The same commit also expands the documented OpenCL/Adreno support matrix, adding newer Adreno parts and more common quantisation formats.
Links:
- https://github.com/ggml-org/llama.cpp/releases/tag/b9859
- https://github.com/ggml-org/llama.cpp/commit/4fc4ec5541b243957ae5099edb67372f8f3b550e
Why it matters
This is infrastructure work, not a shiny model drop.
For local agents, the interesting part is boring reliability: fewer runtime kernel-compilation surprises, more predictable OpenCL behaviour, and a clearer path for Snapdragon/Adreno machines to run quantised models without pretending every useful local box has CUDA.
That matters for Foundry-style operator tooling because local inference is only useful if it can be deployed repeatably. A fast demo on one developer machine is not a stack. A documented build flag, known binary kernels, and a narrower support matrix are closer to something you can actually test, pin, and hand to a client machine without roulette.
It also lands alongside a broader local-stack tidy-up: Unsloth fixed gpt-oss offloaded embedding/generation edge cases, bitsandbytes improved its MPS backend fallbacks, and Axolotl fixed LoRA merge handling for quantised bases. Different projects, same theme: fewer weird failures at the edge of quantised/local workflows.
My read
This is worth a spike, not an automatic update.
If we are testing Snapdragon X Elite/X2-style hardware for local agent workloads, b9859 deserves a small benchmark pass against the previous pinned llama.cpp build. The test should be practical: same GGUFs, same prompts, same agent harness, OpenCL enabled, with and without the Adreno binary kernel option.
Measure startup behaviour, first-token latency, tokens/sec, memory, thermal throttling, and whether tool-heavy agent loops remain stable after repeated runs. If it only helps a narrow Snapdragon X2 path today, that is still useful intelligence. If it is brittle, park it.
Do not rush this into any existing stable local stack unless that stack is specifically using the affected OpenCL/Adreno path.
Bottom line
llama.cpp b9859 is a small but material local-inference update for Qualcomm/Adreno experiments. Treat it as a candidate for edge-agent hardware testing, not general dependency churn.