We introduce ViLPAct, a novel vision-language benchmark for human activity planning. It is designed for a task where embodied AI agents can reason and forecast future actions of humans based on video clips about their initial activities and intents in text. The dataset consists of 2.9k videos from \charades extended with intents via crowdsourcing, a multi-choice question test set, and four strong baselines. One of the baselines implements a neurosymbolic approach based on a multi-modal knowledge base (MKB), while the other ones are deep generative models adapted from recent state-of-the-art (SOTA) methods. According to our extensive experiments, the key challenges are compositional generalization and effective use of information from both modalities.
翻译:我们提出了ViLPAct,一个用于人类活动规划的新型视觉-语言基准。该基准旨在让具身AI代理能够根据视频片段中描述的初始活动及其文本意图,推理并预测人类未来的行动。数据集包含来自Charades的2.9k个视频片段,通过众包方式扩展了意图标注,并构建了多选问题测试集及四个强基线模型。其中一个基线实现了基于多模态知识库(MKB)的神经符号方法,其他基线则改编自近期最先进(SOTA)的深度生成模型。根据我们的广泛实验,关键挑战在于组合泛化能力以及有效利用双模态信息。