OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoBENCHMARK RESULT
Strix Halo makes 122B local inference usable
A LocalLLaMA benchmark run shows AMD's Ryzen AI Max+ 395 with Radeon 8060S graphics, 128GB unified memory, ROCm 7.2, and llama.cpp pushing far past the usual iGPU ceiling. The standout result is roughly 21 tok/s generation on a 122B Qwen3.5 MoE quant, alongside nearly 6,000 tokens/s prompt processing on Qwen3.5-0.8B Q4.
// ANALYSIS
This is the kind of benchmark that makes local AI hardware feel materially different, not just incrementally faster. Strix Halo is starting to look like a legitimate single-box inference platform for developers who want to run huge GGUF models without jumping to a discrete GPU rig.
- –The real headline is not the tiny-model speed, but that a 122B-class quant is actually usable on integrated graphics with unified memory.
- –MoE models look especially strong here, with Qwen3.5-35B-A3B and GLM-4.7-Flash posting numbers that make local experimentation much more practical.
- –ROCm 7.2 appears to be a meaningful step for AMD's local LLM story, especially versus the usual perception that CUDA is the only serious path.
- –For AI developers, this points to a new sweet spot between laptop-class convenience and workstation-class local inference capacity.
- –The benchmark is still a community Reddit post, not a vendor-run validation, but the results line up with the growing interest in Strix Halo as a local-LLM box.
// TAGS
amd-ryzen-ai-max-plus-395llmgpuinferencebenchmark
DISCOVERED
34d ago
2026-03-09
PUBLISHED
34d ago
2026-03-09
RELEVANCE
8/ 10
AUTHOR
przbadu