Given a natural language, a general robot has to comprehend the instruction and find the target object or location based on visual observations even in unexplored environments. Most agents rely on massive diverse training data to achieve better generalization, which requires expensive labor. These agents often focus on common objects and fewer tasks, thus are not intelligent enough to handle different types of instructions. To facilitate research in open-set vision-and-language navigation, we propose a benchmark named MO-VLN, aiming at testing the effectiveness and generalization of the agent in the multi-task setting. First, we develop a 3D simulator rendered by realistic scenarios using Unreal Engine 5, containing more realistic lights and details. The simulator contains three scenes, i.e., cafe, restaurant, and nursing house, of high value in the industry. Besides, our simulator involves multiple uncommon objects, such as takeaway cup and medical adhesive tape, which are more complicated compared with existing environments. Inspired by the recent success of large language models (e.g., ChatGPT, Vicuna), we construct diverse high-quality data of instruction type without human annotation. Our benchmark MO-VLN provides four tasks: 1) goal-conditioned navigation given a specific object category (e.g., "fork"); 2) goal-conditioned navigation given simple instructions (e.g., "Search for and move towards a tennis ball"); 3) step-by-step instruction following; 4) finding abstract object based on high-level instruction (e.g., "I am thirsty").
翻译:给定自然语言指令,通用机器人需理解指令并通过视觉观测在未知环境中定位目标物体或位置。现有多数智能体依赖大规模多样化训练数据以提升泛化能力,但此类数据标注成本高昂。这些智能体常聚焦于常见物体与有限任务类型,难以智能应对多样化指令。为促进开放集视觉与语言导航研究,我们提出MO-VLN基准,旨在多任务场景下测试智能体的效能与泛化能力。首先,基于虚幻引擎5开发具备高真实感渲染的3D模拟器,包含更逼真的光照与细节。模拟器涵盖咖啡馆、餐厅、养老院三个工业价值场景,并引入外卖杯、医用胶带等现有关键环境缺少的非常见物体。受大语言模型(如ChatGPT、Vicuna)成功启发,我们无需人工标注即可构建高质量多样化指令数据。MO-VLN基准包含四项任务:1)基于物体类别(如“叉子”)的目标条件导航;2)基于简单指令(如“寻找并移向网球”)的目标条件导航;3)逐步指令跟随;4)基于高维指令(如“我渴了”)的抽象物体搜索。