We introduce PlausiVL, a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions, prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation, we explore the generative capability of a large video-language model in our work and further, develop the understanding of plausibility in an action sequence by introducing two objective functions, a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization, the model is able to generate diverse, plausible action sequences. We evaluate our approach on two large-scale datasets, Ego4D and EPIC-Kitchens-100, and show improvements on the task of action anticipation.
翻译:本文提出PlausiVL,一种用于预测现实世界中可信动作序列的大型视频语言模型。尽管在预测未来动作方面已取得显著进展,但现有方法均未考虑动作序列的可信性维度。为突破此局限,本研究探索了大型视频语言模型的生成能力,并通过引入两个目标函数——基于反事实的可信动作序列学习损失和长时域动作重复损失——进一步深化对动作序列可信性的理解。我们利用时序逻辑约束及动词-名词动作对逻辑约束构建不可信/反事实动作序列,并采用可信动作序列学习损失训练模型。该损失函数帮助模型区分可信与不可信动作序列,同时促使模型学习对动作预测任务至关重要的隐式时序线索。长时域动作重复损失对较长时序窗口内更易重复的动作施加更高惩罚,通过这种惩罚机制,模型能够生成多样化且可信的动作序列。我们在两个大规模数据集Ego4D和EPIC-Kitchens-100上评估了所提方法,并在动作预测任务中取得了性能提升。