PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion

Recently, significant advancements have been made in 3D generative models, however training these models across diverse domains is challenging and requires an huge amount of training data and knowledge of pose distribution. Text-guided domain adaptation methods have allowed the generator to be adapted to the target domains using text prompts, thereby obviating the need for assembling numerous data. Recently, DATID-3D presents impressive quality of samples in text-guided domain, preserving diversity in text by leveraging text-to-image diffusion. However, adapting 3D generators to domains with significant domain gaps from the source domain still remains challenging due to issues in current text-to-image diffusion models as following: 1) shape-pose trade-off in diffusion-based translation, 2) pose bias, and 3) instance bias in the target domain, resulting in inferior 3D shapes, low text-image correspondence, and low intra-domain diversity in the generated samples. To address these issues, we propose a novel pipeline called PODIA-3D, which uses pose-preserved text-to-image diffusion-based domain adaptation for 3D generative models. We construct a pose-preserved text-to-image diffusion model that allows the use of extremely high-level noise for significant domain changes. We also propose specialized-to-general sampling strategies to improve the details of the generated samples. Moreover, to overcome the instance bias, we introduce a text-guided debiasing method that improves intra-domain diversity. Consequently, our method successfully adapts 3D generators across significant domain gaps. Our qualitative results and user study demonstrates that our approach outperforms existing 3D text-guided domain adaptation methods in terms of text-image correspondence, realism, diversity of rendered images, and sense of depth of 3D shapes in the generated samples

翻译：近期，三维生成模型取得了显著进展，但在不同领域间训练这些模型仍具有挑战性，且需要大量训练数据和姿态分布知识。文本引导的领域自适应方法允许通过文本提示将生成器适配到目标领域，从而避免收集海量数据。其中，DATID-3D方法在文本引导领域展现了令人印象深刻的样本质量，通过利用文本到图像扩散保留了文本多样性。然而，由于当前文本到图像扩散模型存在以下问题，将三维生成器适配到与源领域存在显著领域差距的目标领域仍具挑战：1）基于扩散的翻译中形状-姿态的权衡；2）姿态偏差；3）目标领域的实例偏差，导致生成样本中三维形状质量低下、文本-图像对应性差且领域内多样性不足。针对这些问题，我们提出了一种名为PODIA-3D的新型流程，该流程采用基于保持姿态的文本到图像扩散实现三维生成模型的领域自适应。我们构建了保持姿态的文本到图像扩散模型，允许使用极高强度的噪声进行显著的领域变化。同时，我们还提出了从专用到通用的采样策略，以提升生成样本的细节质量。此外，为克服实例偏差，我们引入了一种文本引导的去偏方法，增强了领域内多样性。最终，我们的方法成功实现了三维生成器在显著领域差距下的自适应。定性结果与用户研究表明，在生成样本的文本-图像对应性、逼真度、渲染图像多样性以及三维形状深度感方面，我们的方法优于现有三维文本引导领域自适应方法。