OPEN_SOURCE ↗
REDDIT · REDDIT// 10d agoBENCHMARK RESULT
Nick Lothian SQL benchmark crowns Qwen 122B
Nick Lothian’s agentic text-to-SQL benchmark found that large Qwen and Nemotron variants still dominate on a consumer RTX 5080, especially when VRAM is supplemented with RAM offload. The standout surprise is a small Qwen3.5 9B Claude-4.6 high-IQ finetune, which jumps from 5 to 16 green tests by fixing tool-call formatting.
// ANALYSIS
The big takeaway is that tool-calling quality now matters almost as much as raw model size for SQL agents, and a well-tuned small model can close a lot of the gap. But this is still a narrow benchmark: it’s a strong read on single-shot agentic SQL, not a proxy for broader codebase reasoning.
- –Qwen3.5-122B-A10B is the clear heavyweight winner here, with RAM offload making it usable on 16GB VRAM cards if you can tolerate slower inference
- –Qwen3.5-9B Claude-4.6 HighIQ is the practical surprise: most of its earlier failures came from malformed tool calls, so the finetune is doing real work, not just posturing
- –Nemotron-Cascade-2-30B-A3B looks unusually competitive for its size and deserves attention as a self-hostable sweet spot
- –The benchmark is deliberately short and agentic, so models that are good at isolated SQL generation can shine even if they may not generalize to longer multi-step coding tasks
- –For local LLM users, this reinforces the tradeoff triangle: bigger models still win on quality, but quantization, offload, and tool-call reliability decide what is actually usable
// TAGS
benchmarkllmagentsqlqwenself-hostedgpulocal-llm
DISCOVERED
10d ago
2026-04-01
PUBLISHED
10d ago
2026-04-01
RELEVANCE
8/ 10
AUTHOR
grumd