We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.
翻译:我们提出了一种从稀疏图像进行高保真新视角合成(NVS)的新颖框架,旨在解决近期基于视觉Transformer(ViT)主干构建的前馈式3D高斯泼溅(3DGS)方法的关键局限。虽然基于ViT的流程提供了强大的几何先验,但由于计算成本,它们通常受限于低分辨率输入。此外,现有的生成式增强方法往往是三维无关的,导致跨视角的结构不一致,尤其是在未观测区域。为克服这些挑战,我们设计了一个双域细节感知模块,使其能够处理高分辨率图像而不受ViT主干限制,并赋予高斯模型额外特征以存储高频细节。我们开发了一个特征引导的扩散网络,可在修复过程中保持高频细节。我们引入了一种统一的训练策略,使得基于ViT的几何主干与基于扩散的优化模块能够联合优化。实验表明,我们的方法在多个数据集上均能保持卓越的生成质量。