OPEN_SOURCE ↗
LOBSTERS · LOBSTERS// 40d agoNEWS
LISTEN benchmark finds audio LLMs still read transcripts
The LISTEN paper introduces a controlled benchmark for separating lexical cues from acoustic emotion cues and reports that six state-of-the-art audio LLMs rely heavily on text transcripts. Across cue-conflict and paralinguistic settings, performance drops toward chance, suggesting current systems are far better at transcription than true acoustic understanding.
// ANALYSIS
Audio LLM UX is outpacing audio LLM perception, and this paper is a needed reality check for anyone building speech-native agents.
- –The benchmark isolates where models fail: neutral text plus emotional tone still leads to “neutral” predictions.
- –Cue-conflict tests expose weak multimodal arbitration, which matters for sarcasm, stress detection, and real call-center audio.
- –For developers, this implies transcript-first pipelines can hide core model weaknesses in production.
- –It aligns with broader 2025-2026 audio-eval work showing many “audio” gains are actually language-model priors.
// TAGS
listenllmspeechmultimodalbenchmarkresearch
DISCOVERED
40d ago
2026-03-03
PUBLISHED
45d ago
2026-02-25
RELEVANCE
8/ 10