Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.
翻译:尽管人工智能取得了巨大进步,但开发能够提供情境化、个性化指导并协助人类完成各种任务的交互式任务指导系统仍是一项重大挑战。这些系统需要深入理解用户和环境,并针对何时以及说什么做出及时准确的决策。为解决这一问题,我们基于人类用户与人类指导员之间的自然交互,创建了一个新的多模态基准数据集——观察、交谈与指导(WTaG)。我们进一步提出了两个任务:用户与环境理解,以及指导员决策制定。我们利用几种基础模型,研究这些模型在多大程度上能够快速适应具有感知能力的任务指导。我们的定量、定性和人工评估结果表明,这些模型在某些情况下无需特定任务训练即可展现出不错的表现,但快速可靠的适应性仍是一项重大挑战。我们的基准和基线方法将为未来关于情境化任务指导的研究提供基础。