BACK_TO_FEEDAICRIER_2
Nick Lothian SQL benchmark crowns Qwen 122B
OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoBENCHMARK RESULT

Nick Lothian SQL benchmark crowns Qwen 122B

Nick Lothian’s agentic text-to-SQL benchmark found that large Qwen and Nemotron variants still dominate on a consumer RTX 5080, especially when VRAM is supplemented with RAM offload. The standout surprise is a small Qwen3.5 9B Claude-4.6 high-IQ finetune, which jumps from 5 to 16 green tests by fixing tool-call formatting.

// ANALYSIS

The big takeaway is that tool-calling quality now matters almost as much as raw model size for SQL agents, and a well-tuned small model can close a lot of the gap. But this is still a narrow benchmark: it’s a strong read on single-shot agentic SQL, not a proxy for broader codebase reasoning.

  • Qwen3.5-122B-A10B is the clear heavyweight winner here, with RAM offload making it usable on 16GB VRAM cards if you can tolerate slower inference
  • Qwen3.5-9B Claude-4.6 HighIQ is the practical surprise: most of its earlier failures came from malformed tool calls, so the finetune is doing real work, not just posturing
  • Nemotron-Cascade-2-30B-A3B looks unusually competitive for its size and deserves attention as a self-hostable sweet spot
  • The benchmark is deliberately short and agentic, so models that are good at isolated SQL generation can shine even if they may not generalize to longer multi-step coding tasks
  • For local LLM users, this reinforces the tradeoff triangle: bigger models still win on quality, but quantization, offload, and tool-call reliability decide what is actually usable
// TAGS
benchmarkllmagentsqlqwenself-hostedgpulocal-llm

DISCOVERED

10d ago

2026-04-01

PUBLISHED

10d ago

2026-04-01

RELEVANCE

8/ 10

AUTHOR

grumd