We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without requiring any form of training.
翻译:我们提出了一种新颖的方法,用于自动合成具身机器人代理的“导航指令”。与先前高度依赖为特定模拟平台独有的人工标注数据集的方法不同,我们的算法利用上下文学习,仅需少量参考即可引导大型语言模型生成指令。通过基于大型语言模型的视觉问答策略,我们收集环境的详细信息,供大型语言模型用于指令合成。我们在多个模拟平台上实现该方法,包括Matterport3D、AI Habitat和ThreeDWorld,从而证明了其平台无关性。我们通过用户研究对方法进行主观评估,发现83.3%的用户认为合成的指令能准确捕捉环境细节,并展现出与人类生成指令相似的特征。此外,我们使用生成的指令在REVERIE数据集上对多种方法进行零样本导航,观察到与基线在标准成功指标(SR变化小于1%)上高度相关,量化了生成指令替代人工标注数据的可行性。据我们所知,这是首个能够以平台无关方式生成“类人”指令且无需任何形式训练的大型语言模型驱动方法。