Direct3D: Scalable Image-to-3D Generation via 3D Latent Diffusion Transformer

Generating high-quality 3D assets from text and images has long been challenging, primarily due to the absence of scalable 3D representations capable of capturing intricate geometry distributions. In this work, we introduce Direct3D, a native 3D generative model scalable to in-the-wild input images, without requiring a multiview diffusion model or SDS optimization. Our approach comprises two primary components: a Direct 3D Variational Auto-Encoder (D3D-VAE) and a Direct 3D Diffusion Transformer (D3D-DiT). D3D-VAE efficiently encodes high-resolution 3D shapes into a compact and continuous latent triplane space. Notably, our method directly supervises the decoded geometry using a semi-continuous surface sampling strategy, diverging from previous methods relying on rendered images as supervision signals. D3D-DiT models the distribution of encoded 3D latents and is specifically designed to fuse positional information from the three feature maps of the triplane latent, enabling a native 3D generative model scalable to large-scale 3D datasets. Additionally, we introduce an innovative image-to-3D generation pipeline incorporating semantic and pixel-level image conditions, allowing the model to produce 3D shapes consistent with the provided conditional image input. Extensive experiments demonstrate the superiority of our large-scale pre-trained Direct3D over previous image-to-3D approaches, achieving significantly better generation quality and generalization ability, thus establishing a new state-of-the-art for 3D content creation. Project page: https://nju-3dv.github.io/projects/Direct3D/.

翻译：从文本和图像生成高质量3D资产长期以来一直面临挑战，主要原因是缺乏能够捕捉复杂几何分布的可扩展3D表示方法。在本工作中，我们提出了Direct3D，这是一种原生3D生成模型，可扩展至真实世界输入图像，且无需多视角扩散模型或SDS优化。我们的方法包含两个核心组件：直接3D变分自编码器（D3D-VAE）与直接3D扩散Transformer（D3D-DiT）。D3D-VAE能够将高分辨率3D形状高效编码至紧凑且连续的潜在三平面空间。值得注意的是，本方法通过半连续表面采样策略直接监督解码几何，这与先前依赖渲染图像作为监督信号的方法截然不同。D3D-DiT对编码后的3D潜在分布进行建模，其专门设计用于融合来自三平面潜在三个特征图的位置信息，从而构建出可扩展至大规模3D数据集的原生3D生成模型。此外，我们提出了一种创新的图像到3D生成流程，该流程融合了语义级与像素级图像条件，使模型能够生成与给定条件图像输入一致的3D形状。大量实验证明，我们的大规模预训练Direct3D模型在生成质量与泛化能力上均显著优于先前的图像到3D方法，从而为3D内容创建确立了新的技术标杆。项目页面：https://nju-3dv.github.io/projects/Direct3D/。