Recent advances in the areas of Multimodal Machine Learning and Artificial Intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Robotics. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly leverage computer vision and natural language for interaction in physical environments. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the current and new algorithmic approaches, metrics, simulators, and datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalisability and furthers real-world deployment.
翻译:近年来,多模态机器学习与人工智能领域的进展催生了一系列兼具计算机视觉、自然语言处理与机器人学交叉特性的挑战性任务。尽管已有诸多方法和综述研究关注其中一至两个维度,但尚未出现对这三个维度核心交集的整体性分析。此外,即便涉及这些主题的组合,现有研究也更侧重于描述当前架构方法等具体内容,而非同时阐述该领域的高层挑战与机遇。本综述论文讨论具身视觉-语言规划任务——一类通过联合计算机视觉与自然语言在物理环境中进行交互的具身导航与操作问题。我们提出一个分类体系以统一这些任务,并对当前及新兴的算法方法、评估指标、仿真环境与数据集进行深度分析与比较。最后,我们提出新EVLP工作应着力解决的核心挑战,并倡导构建能提升模型泛化能力并推动现实世界部署的任务框架。