BACK_TO_FEEDAICRIER_2
Strix Halo makes 122B local inference usable
OPEN_SOURCE ↗
REDDIT · REDDIT// 34d agoBENCHMARK RESULT

Strix Halo makes 122B local inference usable

A LocalLLaMA benchmark run shows AMD's Ryzen AI Max+ 395 with Radeon 8060S graphics, 128GB unified memory, ROCm 7.2, and llama.cpp pushing far past the usual iGPU ceiling. The standout result is roughly 21 tok/s generation on a 122B Qwen3.5 MoE quant, alongside nearly 6,000 tokens/s prompt processing on Qwen3.5-0.8B Q4.

// ANALYSIS

This is the kind of benchmark that makes local AI hardware feel materially different, not just incrementally faster. Strix Halo is starting to look like a legitimate single-box inference platform for developers who want to run huge GGUF models without jumping to a discrete GPU rig.

  • The real headline is not the tiny-model speed, but that a 122B-class quant is actually usable on integrated graphics with unified memory.
  • MoE models look especially strong here, with Qwen3.5-35B-A3B and GLM-4.7-Flash posting numbers that make local experimentation much more practical.
  • ROCm 7.2 appears to be a meaningful step for AMD's local LLM story, especially versus the usual perception that CUDA is the only serious path.
  • For AI developers, this points to a new sweet spot between laptop-class convenience and workstation-class local inference capacity.
  • The benchmark is still a community Reddit post, not a vendor-run validation, but the results line up with the growing interest in Strix Halo as a local-LLM box.
// TAGS
amd-ryzen-ai-max-plus-395llmgpuinferencebenchmark

DISCOVERED

34d ago

2026-03-09

PUBLISHED

34d ago

2026-03-09

RELEVANCE

8/ 10

AUTHOR

przbadu