OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT
"RTX 5090 dominates local AI benchmarks" follows headlinese.
New benchmarks for the sparse MoE model Qwen3.6-35B-A3B reveal that the NVIDIA RTX 5090 achieves a record-breaking 220+ tokens per second using llama.cpp. While NVIDIA's GDDR7 bandwidth provides a massive leap in raw generation speed, the Mac M5 Max remains the "context king" for developers needing massive 128GB unified memory pools for repository-level reasoning.
// ANALYSIS
The RTX 5090’s GDDR7 bandwidth finally makes sparse MoE models feel like local "instant" intelligence, but Apple’s memory architecture still wins on utility for deep codebase reasoning.
- –The 5090 delivers a ~30% generation speed increase over the 4090, peaking at 240 t/s during long-context generation.
- –Qwen3.6-35B-A3B activates only 3B parameters per token, allowing the aging RTX 3090 to still deliver a respectable 140 t/s.
- –Mac M5 Max is restricted by memory bandwidth for raw speed (~95 t/s) but can natively host 1M token context windows that would require 4+ RTX 3090s to fit in VRAM.
- –These results suggest that for developer agentic workflows, the 5090 is the new gold standard for latency, while high-RAM Macs remain the standard for large-scale repo analysis.
// TAGS
llmgpubenchmarkopen-weightsqwen-3-6rtx-5090mac-m5-maxllama-cpp
DISCOVERED
5h ago
2026-04-20
PUBLISHED
6h ago
2026-04-19
RELEVANCE
9/ 10
AUTHOR
chain-77