Mistral Small 4 posts strong single-GPU benchmark

// 117d agoBENCHMARK RESULT

Mistral Small 4 posts strong single-GPU benchmark

A LocalLLaMA benchmark on one RTX Pro 6000 (SGLang, no prompt caching, no speculative decoding) shows strong single-user decode for Mistral-Small-4-119B-2603-NVFP4, with 131.3 tok/s at 1K context and 64.2 tok/s at 256K. The main bottleneck is TTFT at higher context and concurrency, which climbs from sub-second at short prompts to 66.8s at 256K.

// ANALYSIS

This is a solid real-world datapoint for teams considering a single-card Mistral Small 4 deployment: throughput is usable, but responsiveness depends heavily on prompt-caching strategy.

–The benchmark confirms worst-case uncached behavior, so production chat/coding workflows with incremental context should see materially better TTFT.
–At 32K context, capacity lands around 3 concurrent requests under conservative UX thresholds, which is practical for small internal copilots.
–At 64K-96K, TTFT becomes the binding constraint before decode speed, so queueing and admission control matter more than raw tok/s.
–The reported vLLM-vs-SGLang tradeoff (better TTFT vs slower decode) suggests stack tuning is now as important as model choice on Blackwell-class cards.
–For long-context agent workloads, KV-cache optimization and speculative decoding support will likely determine whether this setup scales beyond single-user “power mode.”

// TAGS

mistral-small-4llminferencegpubenchmarkopen-weightsself-hosted

DISCOVERED

117d ago

2026-03-17

PUBLISHED

117d ago

2026-03-17

RELEVANCE

9/ 10

AUTHOR

jnmi235

// KEEP READING

More AI developer news from the feed

EXPLORE FULL FEED

VIDEO28m ago

Higgsfield drops developer CLI and MCP server

Higgsfield has launched a developer CLI and MCP server, allowing programmers and autonomous agents to programmatically trigger, customize, and edit marketing ads and cinematic videos directly through terminal commands. Demonstrated by developer Cole Medin using Anthropic's Claude Code and the Archon workflow engine, the toolkit enables fully automated video production pipelines.

OPEN SOURCE28m ago

AI Content Factory automates video ads

AI Content Factory is an open-source workflow that automates bulk marketing video generation from a product catalog. Built on the Archon agentic engine and Higgsfield CLI, it reduces costs by gating expensive video rendering behind cheap image exploration and human approval.

NEWS2h ago

George Hotz shares his enthusiasm for LLMs and open-source coding agents while criticizing doom-mongering and the overinflated valuations of frontier AI labs.

George Hotz (geohot) details his excitement for the practical applications of AI—such as LLMs, self-driving cars, video generation models, and AI coding agents—highlighting his successful setup of the open-source agent OpenCode on a local GLM-5.2 model. However, he strongly criticizes the prevailing industry hype, safety-related doom-mongering, and the multibillion-dollar valuations of frontier AI labs. Hotz argues that frontier labs will fail to capture most of the AI value because AI is a commodity driven by Moore's law and general computing progress. He also frames coding models not as autonomous creators, but as valuable productivity tools analogous to compilers, find-and-replace, or Stack Overflow that are changing the nature of programming.