Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. For the LLM to generate actionable plans, scene context must be provided, often through a map. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps based on queryable embeddings capable of representing any semantic class. However, embeddings cannot directly report the scene context as they are implicit, requiring further processing for LLM integration. To address this, we propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs due to their text-based nature by building upon large-scale image recognition models. We study how entities in our map can be localized and show through evaluations that our text-based map localizations perform comparably to those from open vocabulary maps while using two to four orders of magnitude less memory. Real-robot experiments demonstrate the grounding of an LLM with the text-based map to solve user tasks.
翻译:大型语言模型(LLM)已成为机器人利用常识推理生成任务规划的工具。为使LLM生成可执行的规划,必须提供场景上下文,通常通过地图实现。近期研究已从具有固定语义类别的显式地图,转向基于可查询嵌入的隐式开放词汇地图,这种嵌入能够表示任意语义类别。然而,由于嵌入是隐式的,无法直接报告场景上下文,需要进一步处理才能与LLM集成。为此,我们提出一种显式的基于文本的地图,该地图能够表示数千个语义类别,并基于大规模图像识别模型,凭借其文本特性可轻松与LLM集成。我们研究了地图中实体的定位方法,并通过评估证明:在内存使用量减少二至四个数量级的同时,本文本地图的定位性能与开放词汇地图相当。真实机器人实验验证了LLM基于文本地图进行任务求解的落地能力。