Large audio language models (LALMs) process both speech and environmental acoustic cues, yet struggle to retain non-speech information across multi-turn interactions. The performance gap between semantic (speech) and acoustic (non-speech) understanding remains poorly understood, and the underlying mechanisms of representation and retrieval are still unclear. This work introduces EnvMem, a controlled multi-turn benchmark designed to study this gap and identify the root causes of failures at the representation (i.e., latent embeddings) and retrieval levels (i.e., attention allocation). We further conduct post-hoc interventions to probe representational structure and attention dynamics. Our results reveal representational trajectory drift as the key failure mode, while showing that attention allocation plays a limited role in explaining the observed degradation. Overall, we provide a systematic framework for analyzing and improving non-linguistic memory in long-context LALMs, shedding light on future data and training design for robust acoustic memory modeling.
翻译:大型音频语言模型(LALMs)能够同时处理语音和环境声学线索,但在多轮交互中难以保留非语音信息。语义(语音)与声学(非语音)理解之间的性能差距尚不明确,其表征与检索的底层机制仍不清楚。本文引入EnvMem——一个受控的多轮基准测试,旨在研究这一差距并识别表征(即潜在嵌入)与检索层面(即注意力分配)的失败根源。我们进一步通过事后干预来探究表征结构与注意力动态。结果表明,表征轨迹漂移是关键失效模式,而注意力分配在解释观测到的性能退化中作用有限。总体而言,我们提供了一个系统框架,用于分析和改进长上下文LALMs中的非语言记忆,为未来稳健声学记忆建模的数据与训练设计提供启示。