OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT
PrismML Ternary Bonsai shows split speeds
Reddit benchmarks paint a mixed picture for PrismML's 1.58-bit Ternary Bonsai models: Mac MLX performance looks strong, while a Windows Ryzen 5700G CPU-only run with the llama.cpp fork is much slower and has painfully long TTFT. The official repo says Q2_0 ternary inference is still in the PrismML fork for several backends, so the gap looks as much like implementation maturity as model quality.
// ANALYSIS
The hot take: this is a backend story more than a model story. PrismML's weights may be interesting, but the real differentiator right now is whether you are on the well-tuned MLX path or a still-maturing llama.cpp/CUDA stack.
- –The Mac MLX numbers are genuinely strong for local inference, especially the 8B model at roughly 41 t/s on a 4K context.
- –The Windows CPU-only results are weak enough that TTFT becomes the headline problem, not just steady-state tokens/sec.
- –PrismML's own docs say ternary Q2_0 inference is currently in its llama.cpp fork, with CUDA support tied to that path, so user results depend heavily on backend readiness.
- –The comparison to the 1-bit Bonsai CPU build suggests the ternary format may be trading compatibility and kernel maturity for the compression headline.
- –For developers evaluating these models, apples-to-apples backend comparisons matter more than raw model claims; the same model family can look excellent or disappointing depending on the runtime.
// TAGS
llmquantizationinferencebenchmarkgpulocal-firstternary-bonsai
DISCOVERED
3h ago
2026-05-04
PUBLISHED
6h ago
2026-05-04
RELEVANCE
9/ 10
AUTHOR
tony10000