BACK_TO_FEEDAICRIER_2
FOMOE runs Qwen3.5-397B on desktop
OPEN_SOURCE ↗
REDDIT · REDDIT// 19d agoOPENSOURCE RELEASE

FOMOE runs Qwen3.5-397B on desktop

FOMOE is an open-source, from-scratch C/HIP MoE inference stack for AMD consumer GPUs that uses VRAM, DRAM, and NVMe caching plus cache-aware routing to make Qwen3.5-397B practical on a $2,100 dual-GPU desktop. The repo claims 5.1 tok/s baseline and 8.8 tok/s with CAR, with a 3.5% WikiText perplexity penalty.

// ANALYSIS

This is less a model breakthrough than a storage-hierarchy hack: FOMOE turns a giant MoE into a caching problem, then wrings speed out of whichever experts are already hot.

  • Dual-GPU ping-pong is a smart way to turn two 16 GB cards into a bigger effective expert cache without heavy cross-GPU coordination, which is exactly the kind of ugly-but-effective trick local inference lives on.
  • Cache-Aware Routing is the novel lever: it substitutes a cached expert when the score gap is small, cutting NVMe reads dramatically and lifting throughput to the reported ~8.8 tok/s.
  • The tradeoff is real. The README itself calls CAR experimental, and a 3.5% WikiText PPL hit may be fine for tinkering but still needs broader validation on coding, long-context, and instruction-following workloads.
  • The bigger implication is that MoE deployment is becoming a memory and I/O optimization game, not just a FLOPS game. That makes local inference more possible, but also more fragile than the headline speed number suggests.
// TAGS
fomoellminferencegpuopen-sourcebenchmark

DISCOVERED

19d ago

2026-03-24

PUBLISHED

19d ago

2026-03-23

RELEVANCE

8/ 10

AUTHOR

Rare-Tadpole-8841