BACK_TO_FEEDAICRIER_2
Qwen3.6 35B fails Stargate counting test
OPEN_SOURCE ↗
REDDIT · REDDIT// 3h agoNEWS

Qwen3.6 35B fails Stargate counting test

A Reddit user tested the reasoning capabilities of Alibaba's Qwen3.6 35B MoE model using a Stargate character count challenge, finding the model struggled to correctly identify the number of 'L's in "Jack O'Neill." Despite high performance on technical benchmarks like SWE-bench, the model required three follow-up prompts to overcome tokenization-related errors and correct spelling hallucinations.

// ANALYSIS

Qwen3.6 demonstrates that "frontier" reasoning models still hit a wall with basic character-level awareness and linguistic groundedness.

  • The model hallucinated the spelling of "O'Neill" as "O'Neil" to justify a single-L count, showcasing a classic conflict between internal knowledge and token-based logic.
  • Sparse MoE architectures like the 35B-A3B variant optimize for inference speed but may sacrifice the granular precision needed for non-textual reasoning tasks.
  • The gap between its 92.7% AIME score and its failure on a "vibe check" counting task highlights the limitations of current synthetic benchmarks in predicting real-world reliability.
  • While the model is highly corrigible through multi-turn dialogue, its zero-shot reasoning remains susceptible to the same pitfalls as previous LLM generations.
// TAGS
qwen3.6llmreasoningbenchmarkalibabaopen-weights

DISCOVERED

3h ago

2026-04-28

PUBLISHED

4h ago

2026-04-28

RELEVANCE

8/ 10

AUTHOR

DashinTheFields