Multi-modal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos. However, existing methods still fall short in tasks like causal or compositional spatiotemporal reasoning over actions, in which model predictions need to be grounded in fine-grained low-level details, such as object motions and object interactions. In this work, we propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, and tracking, to endow the model with the required low-level visual capabilities. We show that a two-stream video encoder with spatiotemporal attention is effective at capturing the required static and motion-based cues in the video. By leveraging the LM's ability to perform the low-level surrogate tasks, we can cast reasoning in videos as the three-step process of Look, Remember, Reason wherein visual information is extracted using low-level visual skills step-by-step and then integrated to arrive at a final answer. We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, and Something-Else datasets. Our approach is trainable end-to-end and surpasses state-of-the-art task-specific methods across these tasks by a large margin.
翻译:多模态语言模型(LM)近期在视频高层次推理任务中展现出显著性能。然而,现有方法在涉及动作的因果或组合时空推理等任务中仍存在不足——此类任务要求模型预测需基于细粒度低层次细节(如物体运动与交互)进行具身化。本研究提出在低层次代理任务(包括目标检测、重识别与跟踪)上对LM进行端到端训练,以赋予模型所需的低层次视觉能力。我们证明,具有时空注意力机制的双流视频编码器能有效捕捉视频中所需的静态与运动线索。通过利用LM执行低层次代理任务的能力,可将视频推理过程分解为"观察-记忆-推理"三阶段:逐步运用低层次视觉技能提取视觉信息,再整合得出最终答案。我们基于ACRE、CATER和Something-Else数据集验证了框架在不同视觉推理任务上的有效性。所提方法可端到端训练,并在这些任务上以大幅优势超越现有任务特定最优方法。