We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
翻译:我们提出了ViLPAct,这是一个面向人类活动规划的新型视觉-语言基准。它专为具身AI代理基于视频片段(关于人类初始活动)和文本意图推理并预测人类未来动作的任务而设计。该数据集包含来自Charades的2.9k个视频,通过众包扩展了意图信息,并提供了多项选择题测试集和四个强基线。其中一个基线实现了基于多模态知识库(MKB)的神经符号方法,而其他基线则改编自最新的最先进(SOTA)深度生成模型。根据我们的大量实验,关键挑战在于组合泛化以及有效利用两种模态的信息。