Video reasoning requires understanding the causal relationships between events in a video. However, such relationships are often implicit and costly to annotate manually. While existing multimodal large language models (MLLMs) often infer event relations through dense captions or video summaries for video reasoning, such modeling still lacks causal understanding. Without explicit causal structure modeling within and across video events, these models suffer from hallucinations during the video reasoning. In this work, we propose GraphThinker, a reinforcement finetuning-based method that constructs structural event-level scene graphs and enhances visual grounding to jointly reduce hallucinations in video reasoning. Specifically, we first employ an MLLM to construct an event-based video scene graph (EVSG) that explicitly models both intra- and inter-event relations, and incorporate these formed scene graphs into the MLLM as an intermediate thinking process. We also introduce a visual attention reward during reinforcement finetuning, which strengthens video grounding and further mitigates hallucinations. We evaluate GraphThinker on two datasets, RexTime and VidHalluc, where it shows superior ability to capture object and event relations with more precise event localization, reducing hallucinations in video reasoning compared to prior methods.
翻译:视频推理需要理解视频中事件之间的因果关系。然而,这类关系通常是隐式的,且人工标注成本高昂。尽管现有的多模态大语言模型(MLLMs)通常通过密集描述或视频摘要来推断事件关系以进行视频推理,但这种建模方式仍缺乏因果理解。由于缺乏对视频事件内部及事件间显式的因果结构建模,这些模型在视频推理过程中容易出现幻觉。本文提出GraphThinker,一种基于强化微调的方法,它构建结构化的事件级场景图并增强视觉基础,以共同减少视频推理中的幻觉。具体而言,我们首先利用一个MLLM构建一个基于事件的视频场景图(EVSG),该图显式地建模事件内部和事件间的关系,并将这些形成的场景图作为中间思维过程整合到MLLM中。我们还在强化微调过程中引入了视觉注意力奖励,以加强视频基础并进一步缓解幻觉。我们在RexTime和VidHalluc两个数据集上评估了GraphThinker,结果表明,与先前方法相比,GraphThinker在捕捉对象和事件关系方面表现出更优的能力,具有更精确的事件定位,从而减少了视频推理中的幻觉。