Needle 26M model hits local tool use
Needle is an ultra-compact 26M parameter model distilled from Gemini for high-speed function calling on mobile and wearables. By removing Feed-Forward Networks in favor of Simple Attention Networks, it achieves massive local performance and outperforms larger models in specialized retrieval-and-assembly tasks.
Needle proves that massive "reasoning" models are overkill for tool orchestration, shifting the bottleneck from model size to architectural efficiency for on-device agents.
- –Replaces FFNs with Simple Attention Networks, treating tool use as retrieval rather than memorization
- –Delivers 6,000 tokens/sec prefill and 1,200 tokens/sec decode on standard consumer hardware
- –Outperforms FunctionGemma-270M and Qwen-0.6B on function-calling accuracy at a fraction of the size
- –Integrates with the Cactus engine to enable sub-150ms agentic latency for privacy-first, on-device apps
- –Open-sourced under MIT license with full weights and a local fine-tuning playground available on GitHub
DISCOVERED
4h ago
2026-05-12
PUBLISHED
6h ago
2026-05-12
RELEVANCE
AUTHOR
HenryNdubuaku