Visual Chain-of-Thought (VCoT) has emerged as a promising paradigm for enhancing multimodal reasoning by integrating visual perception into intermediate reasoning steps. However, existing VCoT approaches are largely confined to static scenarios and struggle to capture the temporal dynamics essential for tasks such as instruction, prediction, and camera motion. To bridge this gap, we propose TwiFF-2.7M, the first large-scale, temporally grounded VCoT dataset derived from $2.7$ million video clips, explicitly designed for dynamic visual question and answer. Accompanying this, we introduce TwiFF-Bench, a high-quality evaluation benchmark of $1,078$ samples that assesses both the plausibility of reasoning trajectories and the correctness of final answers in open-ended dynamic settings. Building on these foundations, we propose the TwiFF model, a unified modal that synergistically leverages pre-trained video generation and image comprehension capabilities to produce temporally coherent visual reasoning cues-iteratively generating future action frames and textual reasoning. Extensive experiments demonstrate that TwiFF significantly outperforms existing VCoT methods and Textual Chain-of-Thought baselines on dynamic reasoning tasks, which fully validates the effectiveness for visual question answering in dynamic scenarios. Our code and data is available at https://github.com/LiuJunhua02/TwiFF.
翻译:视觉思维链作为一种新兴范式,通过将视觉感知融入中间推理步骤,为增强多模态推理能力提供了前景。然而,现有的视觉思维链方法主要局限于静态场景,难以捕捉指令执行、行为预测和相机运动等任务所必需的时间动态特性。为弥补这一空白,我们提出了TwiFF-2.7M——首个基于270万视频片段构建的大规模时序化视觉思维链数据集,专门针对动态视觉问答任务设计。与此同时,我们推出了TwiFF-Bench评估基准,该基准包含1078个高质量样本,用于评估开放式动态场景中推理轨迹的合理性与最终答案的正确性。基于此,我们进一步提出TwiFF模型:该统一模态协同利用预训练视频生成与图像理解能力,通过迭代生成未来动作帧与文本推理,产生时序连贯的视觉推理线索。大量实验表明,在动态推理任务中,TwiFF模型显著优于现有视觉思维链方法及文本思维链基线,充分验证了动态场景下视觉问答的有效性。代码与数据已发布于https://github.com/LiuJunhua02/TwiFF。