We present latentSplat, a method to predict semantic Gaussians in a 3D latent space that can be splatted and decoded by a light-weight generative 2D architecture. Existing methods for generalizable 3D reconstruction either do not scale to large scenes and resolutions, or are limited to interpolation of close input views. latentSplat combines the strengths of regression-based and generative approaches while being trained purely on readily available real video data. The core of our method are variational 3D Gaussians, a representation that efficiently encodes varying uncertainty within a latent space consisting of 3D feature Gaussians. From these Gaussians, specific instances can be sampled and rendered via efficient splatting and a fast, generative decoder. We show that latentSplat outperforms previous works in reconstruction quality and generalization, while being fast and scalable to high-resolution data.
翻译:本文提出了latentSplat,一种在三维隐空间中预测语义高斯分布的方法,该分布可通过轻量级生成式二维架构进行光栅化与解码。现有泛化三维重建方法要么无法扩展至大场景和高分辨率,要么仅限于邻近输入视图的插值。latentSplat结合了基于回归的方法与生成式方法的优势,且仅需使用现成的真实视频数据进行训练。本方法的核心是变分三维高斯分布,该表征能在一个由三维特征高斯分布构成的隐空间内高效编码变化的不确定性。从这些高斯分布中,可通过高效的光栅化与快速的生成式解码器对特定实例进行采样和渲染。实验表明,latentSplat在重建质量与泛化能力上均优于先前工作,同时兼具处理高分辨率数据的高速度与可扩展性。