INHerit-SG: Incremental Hierarchical Semantic Scene Graphs with RAG-Style Retrieval

Driven by advancements in foundation models, semantic scene graphs have emerged as a prominent paradigm for high-level 3D environmental abstraction in robot navigation. However, existing approaches are fundamentally misaligned with the needs of embodied tasks. As they rely on either offline batch processing or implicit feature embeddings, the maps can hardly support interpretable human-intent reasoning in complex environments. To address these limitations, we present INHerit-SG. We redefine the map as a structured, RAG-ready knowledge base where natural-language descriptions are introduced as explicit semantic anchors to better align with human intent. An asynchronous dual-process architecture, together with a Floor-Room-Area-Object hierarchy, decouples geometric segmentation from time-consuming semantic reasoning. An event-triggered map update mechanism reorganizes the graph only when meaningful semantic events occur. This strategy enables our graph to maintain long-term consistency with relatively low computational overhead. For retrieval, we deploy multi-role Large Language Models (LLMs) to decompose queries into atomic constraints and handle logical negations, and employ a hard-to-soft filtering strategy to ensure robust reasoning. This explicit interpretability improves the success rate and reliability of complex retrievals, enabling the system to adapt to a broader spectrum of human interaction tasks. We evaluate INHerit-SG on a newly constructed dataset, HM3DSem-SQR, and in real-world environments. Experiments demonstrate that our system achieves state-of-the-art performance on complex queries, and reveal its scalability for downstream navigation tasks. Project Page: https://fangyuktung.github.io/INHeritSG.github.io/

翻译：受基础模型发展的推动，语义场景图已成为机器人导航中高层三维环境抽象的重要范式。然而，现有方法从根本上与具身任务的需求不匹配。由于它们依赖于离线批处理或隐式特征嵌入，所构建的地图难以在复杂环境中支持可解释的人类意图推理。为解决这些局限性，我们提出了INHerit-SG。我们将地图重新定义为一种结构化的、支持RAG的知识库，其中引入自然语言描述作为显式语义锚点，以更好地与人类意图对齐。一种异步双进程架构，结合楼层-房间-区域-对象的分层结构，将几何分割与耗时的语义推理解耦。一种事件触发的地图更新机制仅在发生有意义的语义事件时才对图进行重组。该策略使得我们的图能够以相对较低的计算开销保持长期一致性。在检索方面，我们部署多角色大语言模型（LLMs）将查询分解为原子约束并处理逻辑否定，同时采用从硬到软的过滤策略以确保鲁棒的推理。这种显式的可解释性提高了复杂检索的成功率和可靠性，使系统能够适应更广泛的人类交互任务。我们在新构建的数据集HM3DSem-SQR以及真实环境中评估了INHerit-SG。实验表明，我们的系统在复杂查询上达到了最先进的性能，并揭示了其在下游导航任务中的可扩展性。项目页面：https://fangyuktung.github.io/INHeritSG.github.io/