Recently, generalizable feed-forward methods based on 3D Gaussian Splatting have gained significant attention for their potential to reconstruct 3D scenes using finite resources. These approaches create a 3D radiance field, parameterized by per-pixel 3D Gaussian primitives, from just a few images in a single forward pass. However, unlike multi-view methods that benefit from cross-view correspondences, 3D scene reconstruction with a single-view image remains an underexplored area. In this work, we introduce CATSplat, a novel generalizable transformer-based framework designed to break through the inherent constraints in monocular settings. First, we propose leveraging textual guidance from a visual-language model to complement insufficient information from a single image. By incorporating scene-specific contextual details from text embeddings through cross-attention, we pave the way for context-aware 3D scene reconstruction beyond relying solely on visual cues. Moreover, we advocate utilizing spatial guidance from 3D point features toward comprehensive geometric understanding under single-view settings. With 3D priors, image features can capture rich structural insights for predicting 3D Gaussians without multi-view techniques. Extensive experiments on large-scale datasets demonstrate the state-of-the-art performance of CATSplat in single-view 3D scene reconstruction with high-quality novel view synthesis.
翻译:近年来,基于三维高斯泼溅的可泛化前馈方法因其利用有限资源重建三维场景的潜力而受到广泛关注。这些方法仅通过单次前向传播即可从少量图像创建由逐像素三维高斯基元参数化的三维辐射场。然而,与受益于跨视图对应关系的多视图方法不同,基于单视图图像的三维场景重建仍是一个尚未充分探索的领域。在本工作中,我们提出CATSplat,一种新颖的基于Transformer的可泛化框架,旨在突破单目设置中的固有约束。首先,我们提出利用视觉-语言模型提供的文本引导来补充单张图像的信息不足。通过跨注意力机制融入来自文本嵌入的场景特定上下文细节,我们为超越单纯视觉线索的上下文感知三维场景重建开辟了路径。此外,我们主张在单视图设置下利用三维点特征的空间引导以实现全面的几何理解。借助三维先验,图像特征能够捕获丰富的结构信息,从而无需多视图技术即可预测三维高斯分布。在大规模数据集上的大量实验表明,CATSplat在单视图三维场景重建与高质量新视角合成方面达到了最先进的性能。