Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D assets with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.
翻译:近年来,基于多视图扩散模型的三维内容生成技术取得了显著进展。然而,与二维扩散模型相比,其在图像质量和提示跟随能力方面仍存在显著差距。一个关键瓶颈在于缺乏带有详细标注的高质量三维资产。为应对这一挑战,我们提出了Bootstrap3D——一种能够自动生成任意数量多视图图像以辅助训练多视图扩散模型的新型框架。具体而言,我们引入了一个数据生成流程,该流程采用(1)基于构建的文本提示、利用二维及视频扩散模型生成多视图图像,以及(2)我们微调后的三维感知MV-LLaVA模型来筛选高质量数据并重写不准确的标注。借助此流程,我们已生成100万张带有密集描述性标注的高质量合成多视图图像,以应对高质量三维数据短缺的问题。此外,我们提出了一种训练时间步重调度策略,该策略利用去噪过程学习多视图一致性,同时保持原有的二维扩散先验。大量实验表明,Bootstrap3D能够生成具有卓越美学质量、图文对齐性并保持视图一致性的高质量多视图图像。