As the world of agentic artificial intelligence applied to robotics evolves, the need for agents capable of building and retrieving memories and observations efficiently is increasing. Robots operating in complex environments must build memory structures to enable useful human-robot interactions by leveraging the mnemonic representation of the current operating context. People interacting with robots may expect the embodied agent to provide information about locations, events, or objects, which requires the agent to provide precise answers within human-like inference times to be perceived as responsive. We propose the Embodied Light Graph Retrieval Agent (EmbodiedLGR-Agent), a visual-language model (VLM)-driven agent architecture that constructs dense and efficient representations of robot operating environments. EmbodiedLGR-Agent directly addresses the need for an efficient memory representation of the environment by providing a hybrid building-retrieval approach built on parameter-efficient VLMs that store low-level information about objects and their positions in a semantic graph, while retaining high-level descriptions of the observed scenes with a traditional retrieval-augmented architecture. EmbodiedLGR-Agent is evaluated on the popular NaVQA dataset, achieving state-of-the-art performance in inference and querying times for embodied agents, while retaining competitive accuracy on the global task relative to the current state-of-the-art approaches. Moreover, EmbodiedLGR-Agent was successfully deployed on a physical robot, showing practical utility in real-world contexts through human-robot interaction, while running the visual-language model and the building-retrieval pipeline locally.
翻译:随着应用于机器人领域的智能体人工智能不断发展,对能够高效构建与检索记忆及观测的智能体的需求日益增长。在复杂环境中运行的机器人必须构建记忆结构,通过利用当前操作环境的记忆表征来实现有效的人机交互。与机器人交互的人类可能期望具身智能体提供关于位置、事件或物体的信息,这要求智能体在类人推理时间内给出精确答案,以被视为具有响应性。我们提出具身轻量图检索智能体(EmbodiedLGR-Agent),一种视觉语言模型(VLM)驱动的智能体架构,能够构建机器人操作环境的致密高效表征。EmbodiedLGR-Agent通过提供基于参数高效VLM的混合构建-检索方法,直接解决了环境高效记忆表征的需求:在语义图中存储物体及其位置的底层信息,同时通过传统检索增强架构保留观察场景的高层描述。EmbodiedLGR-Agent在主流NaVQA数据集上进行了评估,在具身智能体的推理与查询时间上达到了最先进水平,同时在全局任务上保持了与当前最先进方法相当的准确率。此外,EmbodiedLGR-Agent已成功部署于实体机器人,通过人机交互展示了在现实场景中的实用价值,且视觉语言模型与构建-检索流水线均可在本地运行。