Despite significant progress in video question answering (VideoQA), existing methods fall short of questions that require causal/temporal reasoning across frames. This can be attributed to imprecise motion representations. We introduce Action Temporality Modeling (ATM) for temporality reasoning via three-fold uniqueness: (1) rethinking the optical flow and realizing that optical flow is effective in capturing the long horizon temporality reasoning; (2) training the visual-text embedding by contrastive learning in an action-centric manner, leading to better action representations in both vision and text modalities; and (3) preventing the model from answering the question given the shuffled video in the fine-tuning stage, to avoid spurious correlation between appearance and motion and hence ensure faithful temporality reasoning. In the experiments, we show that ATM outperforms previous approaches in terms of the accuracy on multiple VideoQAs and exhibits better true temporality reasoning ability.
翻译:尽管视频问答(VideoQA)领域取得了显著进展,现有方法在处理需要跨帧因果/时间推理的问题时仍存在不足。这归因于不精确的运动表示。我们提出动作时间性建模(ATM),通过三重独特性实现时间性推理:(1)重新审视光流,发现光流能有效捕获长时间跨度的时序推理;(2)以动作为中心的方式通过对比学习训练视觉-文本嵌入,从而在视觉和文本模态中均获得更优的动作表示;(3)在微调阶段防止模型基于打乱顺序的视频回答问题,以避免外观与运动之间的虚假关联,从而确保忠实的时间性推理。实验表明,ATM在多个VideoQA基准上的准确率优于先前方法,并展现出更真实的时间性推理能力。