DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model

Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text.

翻译：近年来，三维生成模型在合成高分辨率逼真图像方面取得了显著进展，能够实现视图一致性与精细三维形状，但在多样领域中进行训练仍面临挑战，因为这需要海量训练图像及其相机分布信息。基于文本引导的域自适应方法通过利用CLIP（对比语言-图像预训练）模型，无需为目标领域收集大规模数据集，即可将某一领域的二维生成模型转换为具有不同风格的其他领域模型，展现出卓越性能。然而，这些方法存在一个缺陷：由于CLIP文本编码器的确定性特性，原始生成模型中的样本多样性在自适应后的生成模型中难以得到良好保持。对于三维生成模型而言，文本引导的域自适应更具挑战性，不仅因为灾难性的多样性损失，还因为文本-图像对应关系不佳及图像质量低下。为此，我们提出DATID-3D——一种专为三维生成模型设计的域自适应方法，该方法利用文本到图像扩散模型，能在无需为目标领域收集额外图像和相机信息的情况下，针对每个文本提示合成多样化图像。与现有文本引导域自适应方法的三维扩展不同，我们的新型流水线能够微调源领域的最先进三维生成器，无需额外数据即可在文本引导的目标领域中合成高分辨率、多视图一致的图像，在多样性和文本-图像对应关系方面均优于现有文本引导域自适应方法。此外，我们提出并展示了多样化的三维图像操作，例如一次性实例选择自适应和单视角操作三维重建，以充分利用文本中的多样性。