We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
翻译:本文提出VIGiA——一种新颖的多模态对话模型,旨在理解和推理复杂的多步骤教学视频动作规划。与先前主要关注纯文本引导或将视觉与语言孤立处理的研究不同,VIGiA支持基于视觉输入、教学规划和交错用户交互进行推理的具身化、规划感知对话。为此,VIGiA整合了两项核心能力:(1)多模态规划推理,使模型能够将单模态/多模态查询与当前任务规划对齐并作出准确响应;(2)基于规划的检索,使其能够以文本或视觉表征形式检索相关规划步骤。我们在与烹饪及DIY规划对齐的、包含丰富教学视频对话的新型数据集上进行了实验。评估结果表明,在对话式规划引导场景中,VIGiA在所有任务上均优于现有最先进模型,在规划感知视觉问答任务中准确率超过90%。