Multimodal MoE models fail visual reasoning via routing divergence
Researchers from Zhejiang University and Alibaba Group reveal that multimodal Mixture-of-Experts models suffer from catastrophic routing divergence in middle layers. The paper demonstrates that while these models correctly perceive images, perceptual signals preemptively hijack cognitive experts, causing reasoning failures.
This paper highlights a fundamental architectural flaw in current multimodal MoE designs — perception overrides cognition instead of collaborating with it.
- –Routing divergence occurs in middle layers, preventing deeper cognitive processing
- –Models are "Seeing but Not Thinking" because perceptual signals hijack cognitive experts early
- –Findings suggest MoE architectures need explicit separation or staging between perception and reasoning layers
- –A critical read for AI researchers building next-gen multimodal foundation models
DISCOVERED
46d ago
2026-04-12
PUBLISHED
46d ago
2026-04-12
RELEVANCE
AUTHOR
Discover AI