Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--12 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, narrative-task prompts (multi-scene summary, narrative summary, character motivation, and event boundary detection) elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Together, our results position long-form narrative movies as a principled testbed for probing biologically relevant temporal integration and interpretable representations in long-context MLLMs.
翻译:理解人类与人工智能系统如何处理复杂的叙事视频,是神经科学与机器学习交叉领域的一个根本性挑战。本研究探讨了视频片段的时间上下文长度(3-12秒片段)以及叙事任务提示如何塑造自然主义电影观看过程中的脑-模型对齐。利用参与者观看完整电影时的功能磁共振成像记录,我们研究了叙事上下文敏感的大脑区域如何在不同时间尺度上动态地表征信息,以及这些神经模式如何与模型提取的特征对齐。我们发现,增加片段时长能显著提高多模态大语言模型的脑对齐度,而单模态视频模型则几乎没有增益。此外,较短的时间窗口与感知及早期语言区域对齐,而较长的时间窗口则优先与高阶整合区域对齐,这反映在多模态大语言模型中由浅层到皮层的层级结构上。最后,叙事任务提示(多场景摘要、叙事摘要、角色动机和事件边界检测)引发了任务特异性、区域依赖性的脑对齐模式,以及高阶区域在片段层面调谐的上下文依赖性偏移。总之,我们的研究结果将长篇叙事电影确立为一个原理性的测试平台,用于探究长上下文多模态大语言模型中具有生物学相关性的时间整合与可解释的表征。