Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.
翻译:音频大语言模型(Audio LLMs)在整合语音感知与语言理解方面展现出强大能力。然而,其内部表征在自然聆听过程中是否与人类神经动态对齐,在很大程度上仍未得到探索。本研究系统性地考察了12个开源音频大语言模型与脑电图(EEG)信号在2个数据集上的分层表征对齐。具体而言,我们采用了8种相似性度量(如基于斯皮尔曼秩相关的表征相似性分析)来刻画句子内部的表征几何结构。我们的分析揭示了三个关键发现:(1)我们观察到一种秩依赖性分裂现象,即模型排名在不同相似性度量间存在显著差异;(2)我们识别出时空对齐模式,其特征为依赖深度的对齐峰值,以及在250-500毫秒时间窗内RSA的显著提升,这与N400相关的神经动态一致;(3)我们发现了一种情感解离现象,即使用所提出的三模态邻域一致性准则识别的负面韵律会降低几何相似性,同时增强基于协方差的依赖性。这些发现为理解音频大语言模型的表征机制提供了新的神经生物学见解。