REDDIT · REDDIT// 3h agoNEWS

Qwen3.6 35B fails Stargate counting test

A Reddit user tested the reasoning capabilities of Alibaba's Qwen3.6 35B MoE model using a Stargate character count challenge, finding the model struggled to correctly identify the number of 'L's in "Jack O'Neill." Despite high performance on technical benchmarks like SWE-bench, the model required three follow-up prompts to overcome tokenization-related errors and correct spelling hallucinations.

// ANALYSIS

Qwen3.6 demonstrates that "frontier" reasoning models still hit a wall with basic character-level awareness and linguistic groundedness.

–The model hallucinated the spelling of "O'Neill" as "O'Neil" to justify a single-L count, showcasing a classic conflict between internal knowledge and token-based logic.
–Sparse MoE architectures like the 35B-A3B variant optimize for inference speed but may sacrifice the granular precision needed for non-textual reasoning tasks.
–The gap between its 92.7% AIME score and its failure on a "vibe check" counting task highlights the limitations of current synthetic benchmarks in predicting real-world reliability.
–While the model is highly corrigible through multi-turn dialogue, its zero-shot reasoning remains susceptible to the same pitfalls as previous LLM generations.

// TAGS

qwen3.6llmreasoningbenchmarkalibabaopen-weights

DISCOVERED

3h ago

2026-04-28

PUBLISHED

4h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

DashinTheFields