Robots operating in homes, warehouses, and other object-rich environments need memory systems that can find specific object instances on demand. Object-level memory alone is often insufficient: scenes contain many plausibly matching objects, and users refer to the target through relations to landmarks and surrounding objects (e.g. ``the tall lamp below the dartboard and to the left of the poster''), demanding a relational spatial memory that supports retrieval through semantic, appearance, and spatial predicates over objects. To achieve this, we present FARM (Find Anything using Relational Spatial Memory), which builds, in real time at 5-10 Hz, a compact, open-vocabulary, object-level memory with geometry, visual-language descriptors, and viewpoint evidence. At query time, FARM uses VLMs to parse the query and score visual evidence, while grounding spatial constraints explicitly through object symbols and relational predicates. This structured use of VLMs enables more accurate and robust retrieval than end-to-end reasoning over frame histories or scene-graph context. In experiments on 44k language queries spanning 67 indoor and outdoor scenes, ranging from 15 to 15,000 m^2, FARM improves Recall@5 and Recall@10 over prior methods by 164% and 224%, and a final VLM reranking stage improves Accuracy@1 by 35%, while running in real time. We further demonstrate closed-loop deployment on a quadrupedal robot using onboard sensors and compute.
翻译:摘要:在家庭、仓库及其他富含物体的环境中运行的机器人,需要具备按需定位特定物体实例的记忆系统。仅依赖物体级记忆往往不足:场景中常存在多个外观相似的候选物体,而用户往往通过目标与地标及周围物体的关系来指代目标(例如“飞镖靶下方、海报左侧的高个台灯”),这要求记忆系统能支持基于语义、外观及物体间空间谓词进行检索的关系空间记忆。为此,我们提出FARM(利用关系空间记忆查找任意物体),该方法能以5-10 Hz的实时频率构建紧凑、开放词汇、包含几何信息、视觉语言描述及视角证据的物体级记忆。查询时,FARM利用视觉语言模型(VLM)解析查询并评估视觉证据,同时通过物体符号与关系谓词显式约束空间关系。这种结构化使用VLM的方式,相较于对帧历史或场景图上下文进行端到端推理,能实现更精准、鲁棒的检索。在涵盖67个室内外场景(面积15至15,000平方米)的44,000条语言查询实验中,FARM的Recall@5和Recall@10较先前方法分别提升164%和224%,最终VLM重排序阶段将Accuracy@1提升35%,且保持实时运行。我们进一步在四足机器人上利用机载传感器与算力实现了闭环部署验证。