We tackle the task of learning dynamic 3D semantic radiance fields given a single monocular video as input. Our learned semantic radiance field captures per-point semantics as well as color and geometric properties for a dynamic 3D scene, enabling the generation of novel views and their corresponding semantics. This enables the segmentation and tracking of a diverse set of 3D semantic entities, specified using a simple and intuitive interface that includes a user click or a text prompt. To this end, we present DGD, a unified 3D representation for both the appearance and semantics of a dynamic 3D scene, building upon the recently proposed dynamic 3D Gaussians representation. Our representation is optimized over time with both color and semantic information. Key to our method is the joint optimization of the appearance and semantic attributes, which jointly affect the geometric properties of the scene. We evaluate our approach in its ability to enable dense semantic 3D object tracking and demonstrate high-quality results that are fast to render, for a diverse set of scenes. Our project webpage is available on https://isaaclabe.github.io/DGD-Website/
翻译:我们致力于解决给定单目视频作为输入时,学习动态三维语义辐射场的任务。我们学习的语义辐射场能够捕捉动态三维场景中每个点的语义信息以及颜色与几何属性,从而支持生成新视角及其对应的语义内容。这使得我们能够对多样化的三维语义实体进行分割与跟踪,这些实体可通过简单直观的交互方式(如用户点击或文本提示)来指定。为此,我们提出了DGD——一种统一表示动态三维场景外观与语义的三维表征方法,其建立在近期提出的动态三维高斯表征基础之上。我们的表征方法会随时间对颜色和语义信息进行联合优化。本方法的核心在于外观属性与语义属性的协同优化,二者共同影响场景的几何特性。我们评估了该方法在实现稠密语义三维物体跟踪方面的能力,并在多种场景中展示了渲染速度快、质量高的结果。项目网页详见:https://isaaclabe.github.io/DGD-Website/