Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.
翻译:多模态大语言模型(MLLMs)能够“听”和“看”,但音频和视觉信号究竟如何通过网络内部路径来塑造答案?尽管它们在研究和实际应用中的作用日益重要,但音频和视觉令牌如何影响最终预测的内部通路仍知之甚少。本研究中,我们考察了音频-视觉大语言模型(AVLLMs)内部的音视频信息流,追踪了AVLLMs在两种输入配置(音视频视频和多个交错音视频项目)下如何路由、利用和整合音频与视觉信息。我们发现,对于音视频视频,AVLLMs遵循为视觉语言模型(VLMs)和视频大语言模型(VideoLLMs)建立的顺序信息流动路径,音频和视觉贡献沿该路径按任务对各模态的依赖程度比例流动。在多个交错音视频项目的设置下,这种路由转向不同的并行流。此外,我们证明,音频-视觉及其他类型的令牌在其信息传递给大语言模型(LLM)后可以被丢弃,对模型预测的影响极小甚至略有改善,这一现象在多个任务和数据集中得到泛化,从而实现更高效的推理。这些发现适用于多个模型和规模(Qwen2.5-Omni与Video-SALMONN2 Plus的3B和7B版本),并引出了关于这些流结构为何出现的假设。综合来看,这些结果首次呈现了AVLLMs如何在网络内部协调声音与视觉的统一图景,并为音频-视觉及更广泛MLLMs在可解释性、设计和效率方面的下一波进展奠定了基础。