LocalLLaMA debates best CPU-only SLMs
The thread’s consensus is that there’s no single CPU-only champion, but Liquid AI’s LFM2.5-1.2B-Instruct is the strongest default for genuinely usable local inference. Heavier options like Gemma 4 E2B/E4B, Qwen MoE variants, and gpt-oss-20b can work, but only when RAM, bandwidth, and decoding tricks line up.
The real winner here is not a model family but a deployment stack: CPU-only AI is now good enough for practical work if you optimize the runtime, quantization, and memory path. The thread makes that explicit by treating throughput and hardware fit as the deciding factors, not just benchmark scores.
- –LFM2.5-1.2B-Instruct gets the strongest praise for being both fast and actually useful on CPU-only setups, especially for tagging and summarization workloads
- –Gemma 4 E2B/E4B and gpt-oss-20b are the “bigger but still local” options, but commenters keep stressing that they get slow fast without enough RAM and bandwidth
- –Qwen MoE variants show why sparse models matter on CPU: a small active parameter count can make a much larger total model surprisingly tractable
- –The stack matters as much as the model: people are using llama.cpp, GGUF, custom kernels, NUMA-aware engines, Ollama, speculative decoding, and even app-specific acceleration like Google AI Edge Gallery
- –The subtext is clear: CPU-only LLMs are no longer a novelty, but if you want responsive chat instead of a science project, you still need to bias hard toward smaller, optimized models
DISCOVERED
17d ago
2026-05-23
PUBLISHED
17d ago
2026-05-23
RELEVANCE
AUTHOR
last_llm_standing