We present GaussExplorer, a framework for embodied exploration and reasoning built on 3D Gaussian Splatting (3DGS). While prior approaches to language-embedded 3DGS have made meaningful progress in aligning simple text queries with Gaussian embeddings, they are generally optimized for relatively simple queries and struggle to interpret more complex, compositional language queries. Alternative studies based on object-centric RGB-D structured memories provide spatial grounding but are constrained by pre-fixed viewpoints. To address these issues, GaussExplorer introduces Vision-Language Models (VLMs) on top of 3DGS to enable question-driven exploration and reasoning within 3D scenes. We first identify pre-captured images that are most correlated with the query question, and subsequently adjust them into novel viewpoints to more accurately capture visual information for better reasoning by VLMs. Experiments show that ours outperforms existing methods on several benchmarks, demonstrating the effectiveness of integrating VLM-based reasoning with 3DGS for embodied tasks.
翻译:本文提出GaussExplorer,一个基于3D高斯泼溅(3DGS)的具身探索与推理框架。尽管先前面向语言嵌入3DGS的研究在将简单文本查询与高斯嵌入对齐方面取得了有意义进展,但这些方法通常针对相对简单的查询进行优化,难以解析更复杂的组合式语言查询。另一类基于以物体为中心的RGB-D结构化记忆的研究虽能提供空间基础,但受限于预设的固定视角。为解决这些问题,GaussExplorer在3DGS基础上引入视觉语言模型(VLMs),以实现三维场景中基于问题驱动的探索与推理。我们首先识别与查询问题最相关的预采集图像,随后将其调整至新视角,以更准确地捕捉视觉信息,从而提升VLMs的推理能力。实验表明,本方法在多个基准测试中优于现有方法,验证了将基于VLM的推理与3DGS相结合在具身任务中的有效性。