BACK_TO_FEEDAICRIER_2
"RTX 5090 dominates local AI benchmarks" follows headlinese.
OPEN_SOURCE ↗
REDDIT · REDDIT// 5h agoBENCHMARK RESULT

"RTX 5090 dominates local AI benchmarks" follows headlinese.

New benchmarks for the sparse MoE model Qwen3.6-35B-A3B reveal that the NVIDIA RTX 5090 achieves a record-breaking 220+ tokens per second using llama.cpp. While NVIDIA's GDDR7 bandwidth provides a massive leap in raw generation speed, the Mac M5 Max remains the "context king" for developers needing massive 128GB unified memory pools for repository-level reasoning.

// ANALYSIS

The RTX 5090’s GDDR7 bandwidth finally makes sparse MoE models feel like local "instant" intelligence, but Apple’s memory architecture still wins on utility for deep codebase reasoning.

  • The 5090 delivers a ~30% generation speed increase over the 4090, peaking at 240 t/s during long-context generation.
  • Qwen3.6-35B-A3B activates only 3B parameters per token, allowing the aging RTX 3090 to still deliver a respectable 140 t/s.
  • Mac M5 Max is restricted by memory bandwidth for raw speed (~95 t/s) but can natively host 1M token context windows that would require 4+ RTX 3090s to fit in VRAM.
  • These results suggest that for developer agentic workflows, the 5090 is the new gold standard for latency, while high-RAM Macs remain the standard for large-scale repo analysis.
// TAGS
llmgpubenchmarkopen-weightsqwen-3-6rtx-5090mac-m5-maxllama-cpp

DISCOVERED

5h ago

2026-04-20

PUBLISHED

6h ago

2026-04-19

RELEVANCE

9/ 10

AUTHOR

chain-77