GLM-4.7-Flash Fumbles Plan-Mode Calls
A community benchmark suggests GLM-4.7-Flash is decent at raw tool calling but inconsistent about when to ask clarifying questions. It scores respectably overall, yet its plan-mode behavior looks brittle enough to trip up agentic coding workflows.
The score spread matters less than the behavior spread: this looks like a model that can use tools, but not one that reliably judges ambiguity.
- –It posts 69% on `plan_mode` and 75% combined, well behind the top entries in the same comparison
- –The reported failure mode cuts both ways: it under-clarifies when it should ask, and over-clarifies on trivial or already-answered inputs
- –That pattern points to weak ambiguity calibration, which is more damaging for coding agents than a simple benchmark miss
- –Its 90% `tool_calling` result says the plumbing is there; the real problem is decision quality around when to invoke it
- –For teams, this reads as a cheap workhorse for straightforward tool use, not a first-choice model for long-horizon agent loops
DISCOVERED
1h ago
2026-05-07
PUBLISHED
2h ago
2026-05-07
RELEVANCE
AUTHOR
No_Run8812