BACK_TO_FEEDAICRIER_2
Mistral Small 4 posts strong single-GPU benchmark
OPEN_SOURCE ↗
REDDIT · REDDIT// 25d agoBENCHMARK RESULT

Mistral Small 4 posts strong single-GPU benchmark

A LocalLLaMA benchmark on one RTX Pro 6000 (SGLang, no prompt caching, no speculative decoding) shows strong single-user decode for Mistral-Small-4-119B-2603-NVFP4, with 131.3 tok/s at 1K context and 64.2 tok/s at 256K. The main bottleneck is TTFT at higher context and concurrency, which climbs from sub-second at short prompts to 66.8s at 256K.

// ANALYSIS

This is a solid real-world datapoint for teams considering a single-card Mistral Small 4 deployment: throughput is usable, but responsiveness depends heavily on prompt-caching strategy.

  • The benchmark confirms worst-case uncached behavior, so production chat/coding workflows with incremental context should see materially better TTFT.
  • At 32K context, capacity lands around 3 concurrent requests under conservative UX thresholds, which is practical for small internal copilots.
  • At 64K-96K, TTFT becomes the binding constraint before decode speed, so queueing and admission control matter more than raw tok/s.
  • The reported vLLM-vs-SGLang tradeoff (better TTFT vs slower decode) suggests stack tuning is now as important as model choice on Blackwell-class cards.
  • For long-context agent workloads, KV-cache optimization and speculative decoding support will likely determine whether this setup scales beyond single-user “power mode.”
// TAGS
mistral-small-4llminferencegpubenchmarkopen-weightsself-hosted

DISCOVERED

25d ago

2026-03-17

PUBLISHED

25d ago

2026-03-17

RELEVANCE

9/ 10

AUTHOR

jnmi235