Audio Large Language Models (Audio LLMs) have demonstrated strong capabilities in integrating speech perception with language understanding. However, whether their internal representations align with human neural dynamics during naturalistic listening remains largely unexplored. In this work, we systematically examine layer-wise representational alignment between 12 open-source Audio LLMs and Electroencephalogram (EEG) signals across 2 datasets. Specifically, we employ 8 similarity metrics, such as Spearman-based Representational Similarity Analysis (RSA), to characterize within-sentence representational geometry. Our analysis reveals 3 key findings: (1) we observe a rank-dependence split, in which model rankings vary substantially across different similarity metrics; (2) we identify spatio-temporal alignment patterns characterized by depth-dependent alignment peaks and a pronounced increase in RSA within the 250-500 ms time window, consistent with N400-related neural dynamics; (3) we find an affective dissociation whereby negative prosody, identified using a proposed Tri-modal Neighborhood Consistency (TNC) criterion, reduces geometric similarity while enhancing covariance-based dependence. These findings provide new neurobiological insights into the representational mechanisms of Audio LLMs.
翻译:音频大语言模型在整合语音感知与语言理解方面展现出强大能力。然而,其内部表征在自然听觉过程中是否与人类神经动态对齐,目前仍缺乏深入探究。本研究系统性地考察了12个开源音频大语言模型与两个数据集中脑电图信号之间的分层表征对齐。具体而言,我们采用基于斯皮尔曼相关系数的表征相似性分析等8种相似性度量,以刻画句子内部的表征几何结构。分析揭示了三个关键发现:(1)观察到排序依赖性分裂现象,即模型在不同相似性度量下的排序存在显著差异;(2)识别出时空对齐模式,其特征为随深度变化的对齐峰值,以及在250-500毫秒时间窗内表征相似性分析的显著增强,这与N400相关的神经动态特征一致;(3)发现情感解离效应:通过提出的三模态邻域一致性准则识别出的负面韵律会降低几何相似性,同时增强基于协方差的依赖性。这些发现为理解音频大语言模型的表征机制提供了新的神经生物学视角。