Task-Conditioned Probing Reveals Brain-Alignment Patterns in Instruction-Tuned Multimodal LLMs

Recent voxel-wise multimodal brain encoding studies have shown that multimodal large language models (MLLMs) exhibit a higher degree of brain alignment compared to unimodal models. More recently, instruction-tuned multimodal (IT) models have been shown to generate task-specific representations that align strongly with brain activity, yet most prior evaluations focus on unimodal stimuli or non-instruction-tuned models under multimodal stimuli. We still lack a clear understanding of whether instruction-tuning is associated with IT-MLLMs organizing their representations around functional task demands or if they simply reflect surface semantics. To address this, we estimate brain alignment by predicting fMRI responses recorded during naturalistic movie watching (video with audio) from MLLM representations. Using instruction-specific embeddings from six video and two audio IT-MLLMs, across 13 video task instructions, we find that instruction-tuned video MLLMs significantly outperform in-context learning (ICL) multimodal models (~9%), non-instruction-tuned multimodal models (~15%), and unimodal baselines (~20%). Our evaluation of MLLMs across video and audio tasks, and language-guided probing produces distinct task-specific MLLM representations that vary across brain regions. We also find that ICL models show strong semantic organization (r=0.78), while IT models show weak coupling to instruction-text semantics (r=0.14), consistent with task-conditioned subspaces associated with higher brain alignment. These findings are consistent with an association between task-specific instructions and stronger brain-MLLM alignment, and open new avenues for mapping joint information processing in both systems. We make the code publicly available [https://github.com/subbareddy248/mllm_videos].

翻译：最近的体素级多模态脑编码研究表明，多模态大语言模型（MLLMs）相比单模态模型展现出更高程度的脑对齐。近期研究进一步表明，指令调优多模态（IT）模型能够生成与脑活动强烈对齐的任务特定表征，然而现有评估多集中于单模态刺激或非指令调优模型在多模态刺激下的表现。我们仍不清楚指令调优是否促使IT-MLLMs围绕功能性任务需求组织其表征，抑或这些表征仅反映表层语义。为探究此问题，我们通过从MLLM表征预测自然观影（带音频视频）期间的fMRI响应来评估脑对齐。基于六个视频和两个音频IT-MLLM在13种视频任务指令下生成的指令特定嵌入，研究发现指令调优视频MLLM显著优于上下文学习（ICL）多模态模型（约9%）、非指令调优多模态模型（约15%）及单模态基线模型（约20%）。通过对MLLM在视频与音频任务中的评估及语言引导探测，我们发现了随脑区变化而呈现差异化的任务特定MLLM表征。研究同时发现ICL模型展现出强烈的语义组织性（r=0.78），而IT模型与指令文本语义的耦合较弱（r=0.14），这与任务条件子空间关联更高脑对齐的结论一致。这些发现证实了任务特定指令与更强脑-MLLM对齐之间的关联性，并为探索双系统联合信息处理机制开辟了新路径。代码已公开于[https://github.com/subbareddy248/mllm_videos]。