Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the texture synthesis and propose a Variance alignment based 3D-2D Collaborative Denoising framework, dubbed VCD-Texture, to address these issues. Formally, we first unify both 2D and 3D latent feature learning in diffusion self-attention modules with re-projected 3D attention receptive fields. Subsequently, the denoised multi-view 2D latent features are aggregated into 3D space and then rasterized back to formulate more consistent 2D predictions. However, the rasterization process suffers from an intractable variance bias, which is theoretically addressed by the proposed variance alignment, achieving high-fidelity texture synthesis. Moreover, we present an inpainting refinement to further improve the details with conflicting regions. Notably, there is not a publicly available benchmark to evaluate texture synthesis, which hinders its development. Thus we construct a new evaluation set built upon three open-source 3D datasets and propose to use four metrics to thoroughly validate the texturing performance. Comprehensive experiments demonstrate that VCD-Texture achieves superior performance against other counterparts.
翻译:近年来,三维形状的纹理合成研究极大地受益于快速发展的二维文本到图像扩散模型,包括基于修复和基于优化的方法。然而,这些方法忽视了二维扩散模型与三维对象之间的模态差异,其主要做法是将三维对象渲染为二维图像并分别对每张图像进行纹理化。本文重新审视纹理合成问题,提出了一种基于方差对齐的三维-二维协同去噪框架,命名为 VCD-Texture,以解决上述问题。具体而言,我们首先在扩散自注意力模块中,通过重投影的三维注意力感受野,统一了二维与三维潜在特征的学习。随后,去噪后的多视角二维潜在特征被聚合到三维空间中,再通过栅格化回传以形成更一致的二维预测。然而,栅格化过程存在难以处理的方差偏差,本文提出的方差对齐方法从理论上解决了该问题,从而实现了高保真度的纹理合成。此外,我们还提出了一种修复细化策略,以进一步提升存在冲突区域的细节表现。值得注意的是,目前缺乏公开可用的基准数据集来评估纹理合成,这阻碍了该领域的发展。为此,我们基于三个开源三维数据集构建了一个新的评估集,并提出了四项指标以全面验证纹理生成性能。综合实验表明,VCD-Texture 相较于其他方法取得了更优的性能。