Action knowledge involves the understanding of textual, visual, and temporal aspects of actions. We introduce the Action Dynamics Benchmark (ActionBench) containing two carefully designed probing tasks: Action Antonym and Video Reversal, which targets multimodal alignment capabilities and temporal understanding skills of the model, respectively. Despite recent video-language models' (VidLM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, Paxion, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. The Paxion framework utilizes a Knowledge Patcher network to encode new action knowledge and a Knowledge Fuser component to integrate the Patcher into frozen VidLMs without compromising their existing capabilities. Due to limitations of the widely-used Video-Text Contrastive (VTC) loss for learning action knowledge, we introduce the DVDM objective to train the Knowledge Patcher. DVDM forces the model to encode the correlation between the action text and the correct ordering of video frames. Our extensive analyses show that Paxion and DVDM together effectively fill the gap in action knowledge understanding (~50% to 80%), while maintaining or improving performance on a wide spectrum of both object- and action-centric downstream tasks.
翻译:动作知识涉及对动作的文本、视觉和时间维度的理解。我们提出了动作动态基准(ActionBench),其中包含两项精心设计的探测任务:动作反义词与视频反转,分别针对模型的多模态对齐能力和时间理解技能。尽管近年来的视频-语言模型(VidLM)在各项基准任务上表现卓越,我们的诊断任务揭示了它们在动作知识方面惊人的缺陷(近乎随机性能),这表明当前模型依赖物体识别能力作为动作理解的捷径。为解决这一问题,我们提出了新颖的Paxion框架及新的判别式视频动态建模(DVDM)目标。Paxion框架利用知识修补器网络编码新的动作知识,并通过知识融合组件将修补器集成到冻结的VidLM中,且不损害其既有能力。针对广泛使用的视频-文本对比(VTC)损失在学习动作知识时的局限性,我们引入DVDM目标来训练知识修补器。DVDM强制模型编码动作文本与视频帧正确顺序之间的关联。广泛分析表明,Paxion与DVDM共同有效填补了动作知识理解的缺陷(从约50%提升至80%),同时在面向物体和动作的下游任务中保持或提升了性能。