Language-conditioned robot manipulation is an emerging field aimed at enabling seamless communication and cooperation between humans and robotic agents by teaching robots to comprehend and execute instructions conveyed in natural language. This interdisciplinary area integrates scene understanding, language processing, and policy learning to bridge the gap between human instructions and robot actions. In this comprehensive survey, we systematically explore recent advancements in language-conditioned robot manipulation. We categorize existing methods based on the primary ways language is integrated into the robot system, namely language for state evaluation, language as a policy condition, language for cognitive planning and reasoning, and language in unified vision-language-action models. Specifically, we further analyze state-of-the-art techniques from five axes of action granularity, data and supervision regimes, system cost and latency, environments and evaluations, and cross-modal task specification. Additionally, we highlight the key debates in the field. Finally, we discuss open challenges and future research directions, focusing on potentially enhancing generalization capabilities and addressing safety issues in language-conditioned robot manipulators.
翻译:语言条件化机器人操作是一个新兴领域,旨在通过教导机器人理解并执行自然语言指令,实现人类与机器人智能体之间的无缝交流与协作。这一跨学科领域整合了场景理解、语言处理与策略学习,以弥合人类指令与机器人行动之间的鸿沟。在本综述中,我们系统性地探讨了语言条件化机器人操作的最新进展。基于语言融入机器人系统的主要方式,我们将现有方法分为四类:用于状态评估的语言、作为策略条件的语言、用于认知规划与推理的语言,以及统一视觉-语言-行动模型中的语言。具体而言,我们进一步从行动粒度、数据与监督范式、系统成本与延迟、环境与评估、以及跨模态任务规范五个维度分析了前沿技术。此外,我们强调了该领域的关键争议。最后,我们讨论了开放挑战与未来研究方向,重点关注如何增强语言条件化机器人操作器的泛化能力并解决安全性问题。