LISTEN benchmark finds audio LLMs still read transcripts
The LISTEN paper introduces a controlled benchmark for separating lexical cues from acoustic emotion cues and reports that six state-of-the-art audio LLMs rely heavily on text transcripts. Across cue-conflict and paralinguistic settings, performance drops toward chance, suggesting current systems are far better at transcription than true acoustic understanding.
Audio LLM UX is outpacing audio LLM perception, and this paper is a needed reality check for anyone building speech-native agents.
- –The benchmark isolates where models fail: neutral text plus emotional tone still leads to “neutral” predictions.
- –Cue-conflict tests expose weak multimodal arbitration, which matters for sarcasm, stress detection, and real call-center audio.
- –For developers, this implies transcript-first pipelines can hide core model weaknesses in production.
- –It aligns with broader 2025-2026 audio-eval work showing many “audio” gains are actually language-model priors.
DISCOVERED
85d ago
2026-03-03
PUBLISHED
91d ago
2026-02-25
RELEVANCE