In recent years, 3D scene graphs have emerged as a powerful world representation, offering both geometric accuracy and semantic richness. Combining 3D scene graphs with large language models enables robots to reason, plan, and navigate in complex human-centered environments. However, current approaches for constructing 3D scene graphs are semantically limited to a predefined set of relationships, and their serialization in large environments can easily exceed an LLM's context window. We introduce KeySG, a framework that represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements, where nodes are augmented with multi-modal information extracted from keyframes selected to optimize geometric and visual coverage. The keyframes allow us to efficiently leverage VLMs to extract scene information, alleviating the need to explicitly model relationship edges between objects, enabling more general, task-agnostic reasoning and planning. Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs by utilizing a hierarchical multi-modal retrieval-augmented generation (RAG) pipeline to extract relevant context from the graph. Evaluated across three distinct benchmarks, 3D object semantic segmentation, functional element segmentation, and complex query retrieval, KeySG outperforms prior approaches on most metrics, demonstrating its superior semantic richness and efficiency.
翻译:近年来,三维场景图作为一种强大的世界表征,同时提供了几何精度与语义丰富性。将三维场景图与大语言模型结合,使机器人能够在复杂的人类中心环境中进行推理、规划与导航。然而,当前构建三维场景图的方法在语义上局限于预定义的关系集合,且在大规模环境中的序列化处理极易超出大语言模型的上下文窗口。我们提出KeySG框架,该框架将三维场景表示为分层图结构,包含楼层、房间、物体与功能元素,其中节点通过从优化几何与视觉覆盖的关键帧中提取的多模态信息进行增强。关键帧使我们能够高效利用视觉语言模型提取场景信息,无需显式建模物体间的关联边,从而实现更通用、与任务无关的推理与规划。本方法可处理复杂模糊查询,并利用分层多模态检索增强生成管道从图中提取相关上下文,缓解大规模场景图的可扩展性问题。在三维物体语义分割、功能元素分割与复杂查询检索三项不同基准评估中,KeySG在多数指标上优于先前方法,展现出其卓越的语义丰富性与效率。