A Pixel Is Worth More Than One 3D Gaussians in Single-View 3D Reconstruction

Learning 3D scene representation from a single-view image is a long-standing fundamental problem in computer vision, with the inherent ambiguity in predicting contents unseen from the input view. Built on the recently proposed 3D Gaussian Splatting (3DGS), the Splatter Image method has made promising progress on fast single-image novel view synthesis via learning a single 3D Gaussian for each pixel based on the U-Net feature map of an input image. However, it has limited expressive power to represent occluded components that are not observable in the input view. To address this problem, this paper presents a Hierarchical Splatter Image method in which a pixel is worth more than one 3D Gaussians. Specifically, each pixel is represented by a parent 3D Gaussian and a small number of child 3D Gaussians. Parent 3D Gaussians are learned as done in the vanilla Splatter Image. Child 3D Gaussians are learned via a lightweight Multi-Layer Perceptron (MLP) which takes as input the projected image features of a parent 3D Gaussian and the embedding of a target camera view. Both parent and child 3D Gaussians are learned end-to-end in a stage-wise way. The joint condition of input image features from eyes of the parent Gaussians and the target camera position facilitates learning to allocate child Gaussians to ``see the unseen'', recovering the occluded details that are often missed by parent Gaussians. In experiments, the proposed method is tested on the ShapeNet-SRN and CO3D datasets with state-of-the-art performance obtained, especially showing promising capabilities of reconstructing occluded contents in the input view.

翻译：从单视图图像学习三维场景表示是计算机视觉领域长期存在的基础性问题，其核心挑战在于预测输入视角不可见内容时固有的模糊性。基于近期提出的3D高斯溅射（3DGS）技术，Splatter Image方法通过基于输入图像U-Net特征图为每个像素学习单一3D高斯，在快速单图像新视角合成方面取得了显著进展。然而，该方法在表示输入视角不可见的遮挡组件时表达能力有限。为解决此问题，本文提出分层Splatter Image方法，其核心思想是单个像素的价值可通过多个3D高斯函数体现。具体而言，每个像素由一个父3D高斯和若干子3D高斯共同表征。父3D高斯的学习遵循原始Splatter Image方法，而子3D高斯则通过轻量级多层感知机（MLP）进行学习——该网络以父3D高斯的投影图像特征与目标相机视角嵌入作为输入。父高斯与子高斯通过分阶段端到端方式联合学习。结合父高斯视角的图像特征与目标相机位置的联合条件，促使系统学会分配子高斯以“窥见不可见”，有效恢复常被父高斯遗漏的遮挡细节。实验部分，本方法在ShapeNet-SRN和CO3D数据集上进行了测试，取得了最先进的性能表现，特别在重建输入视角中遮挡内容方面展现出显著优势。