Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models

We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.

翻译：本文提出PlausiVL，一种用于预测现实世界中可信动作序列的大型视频语言模型。尽管在预测未来动作方面已取得显著进展，但现有方法均未考虑动作序列的可信性维度。为突破此局限，本研究探索了大型视频语言模型的生成能力，并通过引入两个目标函数——基于反事实的可信动作序列学习损失和长时域动作重复损失——进一步深化对动作序列可信性的理解。我们利用时序逻辑约束及动词-名词动作对逻辑约束构建不可信/反事实动作序列，并采用可信动作序列学习损失训练模型。该损失函数帮助模型区分可信与不可信动作序列，同时促使模型学习对动作预测任务至关重要的隐式时序线索。长时域动作重复损失对较长时序窗口内更易重复的动作施加更高惩罚，通过这种惩罚机制，模型能够生成多样化且可信的动作序列。我们在两个大规模数据集Ego4D和EPIC-Kitchens-100上评估了所提方法，并在动作预测任务中取得了性能提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日