APEX benchmark shows prompt position drives compliance
A LocalLLaMA post shares APEX benchmark results across Gemma 3 (4B, 12B) and Qwen3 32B variants, testing how token position in an 8,192-token window affects behavior. The data shows factual recall stays strong across positions, while instruction following drops in the middle and salience integration appears mainly in larger models.
Prompt engineering is still architecture-aware systems design, not just wording tweaks.
- –The U-shaped compliance curve reinforces “lost in the middle” as a practical production issue, not a niche benchmark artifact.
- –Flat factual recall means teams should optimize prompt layout for control and behavior, not basic memory.
- –Near-zero salience integration on smaller models suggests some capabilities are missing, not merely weaker.
- –If replicated at 72B, this could influence RAG chunk ordering, system prompt placement, and agent planning templates.
DISCOVERED
84d ago
2026-03-05
PUBLISHED
84d ago
2026-03-05
RELEVANCE
AUTHOR
Double-Risk-1945