Audio provides critical situational cues, yet current Audio Language Models (ALMs) face an attention bottleneck in long-form recordings where dominant background patterns can dilute rare, salient events. We introduce NAACA, a training-free NeuroAuditory Attentive Cognitive Architecture that reframes attention allocation as an auditory salience filtering problem. At its core is OWM, a neuro-inspired Oscillatory Working Memory that maintains stable attractor-like states and triggers higher-cognition ALM processing only when adaptive energy fluctuations signal perceptual salience, triggering higher-level reasoning. On XD-Violence, NAACA improves AudioQwen's average precision (AP) from 53.50% to 70.60% while reducing unnecessary ALM invocations. Furthermore, qualitative case studies on the Urban Soundscapes of the World (USoW) dataset show that OWM captures novel events and subcategory shifts while remaining robust to transient pauses and ambient urban noise.
翻译:音频提供了关键的情境线索,然而当前音频语言模型(ALM)在长时录音中面临注意力瓶颈——主导性背景模式可能稀释罕见但显著的突发事件。我们提出NAACA,一种无需训练的神经听觉注意认知架构,将注意力分配重新定义为听觉显著过滤问题。其核心是OWM,一种受神经启发的振荡工作记忆,能够维持稳定的吸引子状态,仅当自适应能量波动提示感知显著性时,才触发高层认知ALM处理。在XD-Violence数据集上,NAACA将AudioQwen的平均精度(AP)从53.50%提升至70.60%,同时减少了不必要的ALM调用。此外,在城市声景世界(USoW)数据集上的定性案例研究表明,OWM能够捕捉新事件和子类别转移,同时对短暂停顿和环境城市噪声保持鲁棒性。