We present a novel approach to automatically synthesize "wayfinding instructions" for an embodied robot agent. In contrast to prior approaches that are heavily reliant on human-annotated datasets designed exclusively for specific simulation platforms, our algorithm uses in-context learning to condition an LLM to generate instructions using just a few references. Using an LLM-based Visual Question Answering strategy, we gather detailed information about the environment which is used by the LLM for instruction synthesis. We implement our approach on multiple simulation platforms including Matterport3D, AI Habitat and ThreeDWorld, thereby demonstrating its platform-agnostic nature. We subjectively evaluate our approach via a user study and observe that 83.3% of users find the synthesized instructions accurately capture the details of the environment and show characteristics similar to those of human-generated instructions. Further, we conduct zero-shot navigation with multiple approaches on the REVERIE dataset using the generated instructions, and observe very close correlation with the baseline on standard success metrics (< 1% change in SR), quantifying the viability of generated instructions in replacing human-annotated data. We finally discuss the applicability of our approach in enabling a generalizable evaluation of embodied navigation policies. To the best of our knowledge, ours is the first LLM-driven approach capable of generating "human-like" instructions in a platform-agnostic manner, without training.
翻译:我们提出了一种新颖方法,用于为具身机器人智能体自动合成“寻路指令”。与以往严重依赖专为特定仿真平台设计的人工标注数据集的方法不同,本算法利用上下文学习仅通过少量参考示例即可调节大语言模型生成指令。采用基于大语言模型的视觉问答策略,我们收集环境的详细信息,供大语言模型用于指令合成。我们已在包括Matterport3D、AI Habitat和ThreeDWorld在内的多个仿真平台上实现该方法,从而证明其平台无关性。通过用户研究进行主观评估发现,83.3%的用户认为合成的指令准确捕捉了环境细节,并展现出与人类生成指令相似的特征。此外,我们使用生成的指令在REVERIE数据集上采用多种方法进行零样本导航,观察到标准成功指标与基线高度相关(SR变化<1%),量化了生成指令替代人工标注数据的可行性。最后,我们讨论了该方法在实现具身导航策略可泛化评估方面的适用性。据我们所知,这是首个无需训练、能够以平台无关方式生成“类人”指令的大语言模型驱动方法。