BACK_TO_FEEDAICRIER_2
PrismML Ternary Bonsai shows split speeds
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoBENCHMARK RESULT

PrismML Ternary Bonsai shows split speeds

Reddit benchmarks paint a mixed picture for PrismML's 1.58-bit Ternary Bonsai models: Mac MLX performance looks strong, while a Windows Ryzen 5700G CPU-only run with the llama.cpp fork is much slower and has painfully long TTFT. The official repo says Q2_0 ternary inference is still in the PrismML fork for several backends, so the gap looks as much like implementation maturity as model quality.

// ANALYSIS

The hot take: this is a backend story more than a model story. PrismML's weights may be interesting, but the real differentiator right now is whether you are on the well-tuned MLX path or a still-maturing llama.cpp/CUDA stack.

  • The Mac MLX numbers are genuinely strong for local inference, especially the 8B model at roughly 41 t/s on a 4K context.
  • The Windows CPU-only results are weak enough that TTFT becomes the headline problem, not just steady-state tokens/sec.
  • PrismML's own docs say ternary Q2_0 inference is currently in its llama.cpp fork, with CUDA support tied to that path, so user results depend heavily on backend readiness.
  • The comparison to the 1-bit Bonsai CPU build suggests the ternary format may be trading compatibility and kernel maturity for the compression headline.
  • For developers evaluating these models, apples-to-apples backend comparisons matter more than raw model claims; the same model family can look excellent or disappointing depending on the runtime.
// TAGS
llmquantizationinferencebenchmarkgpulocal-firstternary-bonsai

DISCOVERED

3h ago

2026-05-04

PUBLISHED

6h ago

2026-05-04

RELEVANCE

9/ 10

AUTHOR

tony10000