Developer builds fast local Pi agent on macOS
This tutorial details building a fast, offline coding agent stack on Apple Silicon using llama.cpp, Gemma 4, and MTP speculative decoding. Connecting these components to the Pi open-source terminal agent achieves up to 72 tokens per second with multimodal support.
Speculative decoding (MTP) and macOS-specific optimizations in llama.cpp prove to be highly effective for local LLM performance, even beating MLX on this machine.
- –llama.cpp with Metal acceleration unexpectedly outperformed MLX-LM for this specific Gemma 4 26B setup on an M1 Max.
- –MTP draft models offered a significant 24% boost in text generation speed without hurting prompt processing time.
- –Adding a multimodal projector successfully enables image input for Pi without incurring any text-generation slowdowns.
- –While Gemma 4 is used as the primary example, the author notes that Qwen3.6 35B is a much stronger coding model, albeit with a slight performance penalty (55 tokens/second vs 72 tokens/second).
DISCOVERED
2h ago
2026-06-12
PUBLISHED
5h ago
2026-06-12
RELEVANCE
AUTHOR
kkm