Llama 3.3 70B quantization hits multi-hop bottleneck
New benchmarks for quantized Llama 3.3 70B variants reveal a sharp performance cliff for Q4 models in multi-hop reasoning, despite strong single-hop retrieval. Developers should favor Q8 or Q6_K for tasks requiring logical assembly across non-adjacent context sections.
Llama 3.3 70B's high information density makes it uniquely fragile to aggressive quantization compared to its predecessors. Q4 variants consistently fail to integrate three or more pieces of information from different parts of the context, despite finding the correct chunks. Standard benchmarks like MMLU are failing to capture this "integration gap," which can mislead developers about real-world performance. The failure mode suggests quantization disproportionately damages the attention heads responsible for cross-section coherence. For complex RAG or agentic workflows, Q8 is now the mandatory floor for maintaining logical thread integrity. These findings, conducted on llama.cpp with a 16k context, indicate that quantization is no longer "basically free" for dense 70B models.
DISCOVERED
17d ago
2026-03-26
PUBLISHED
17d ago
2026-03-26
RELEVANCE
AUTHOR
bobupuhocalusof