We introduce a new generative approach for synthesizing 3D geometry and images from single-view collections. Most existing approaches predict volumetric density to render multi-view consistent images. By employing volumetric rendering using neural radiance fields, they inherit a key limitation: the generated geometry is noisy and unconstrained, limiting the quality and utility of the output meshes. To address this issue, we propose GeoGen, a new SDF-based 3D generative model trained in an end-to-end manner. Initially, we reinterpret the volumetric density as a Signed Distance Function (SDF). This allows us to introduce useful priors to generate valid meshes. However, those priors prevent the generative model from learning details, limiting the applicability of the method to real-world scenarios. To alleviate that problem, we make the transformation learnable and constrain the rendered depth map to be consistent with the zero-level set of the SDF. Through the lens of adversarial training, we encourage the network to produce higher fidelity details on the output meshes. For evaluation, we introduce a synthetic dataset of human avatars captured from 360-degree camera angles, to overcome the challenges presented by real-world datasets, which often lack 3D consistency and do not cover all camera angles. Our experiments on multiple datasets show that GeoGen produces visually and quantitatively better geometry than the previous generative models based on neural radiance fields.
翻译:我们提出了一种新的生成方法,用于从单视图集合中合成三维几何与图像。现有方法大多通过预测体密度来渲染多视角一致图像。这些方法采用基于神经辐射场的体渲染技术,因而继承了一个关键局限:生成的几何结构存在噪声且缺乏约束,限制了输出网格的质量与实用性。为解决该问题,我们提出GeoGen——一种端到端训练的、基于符号距离函数(SDF)的新型三维生成模型。我们首先将体密度重新解释为符号距离函数(SDF),这使得我们能够引入有效先验以生成合理网格。然而,这些先验会阻碍生成模型学习细节,限制了该方法在真实场景中的适用性。为缓解此问题,我们将该变换设计为可学习形式,并约束渲染深度图与SDF的零水平集保持一致。通过对抗训练的视角,我们促使网络在输出网格上生成更高保真度的细节。为进行评估,我们构建了一个从360度摄像机角度采集的人体化身合成数据集,以克服真实数据集常存在的三维不一致性及摄像机视角覆盖不全的挑战。在多个数据集上的实验表明,GeoGen在视觉与量化指标上均优于以往基于神经辐射场的生成模型。