While 3D generative models have greatly improved artists' workflows, the existing diffusion models for 3D generation suffer from slow generation and poor generalization. To address this issue, we propose a two-stage approach named Hunyuan3D-1.0 including a lite version and a standard version, that both support text- and image-conditioned generation. In the first stage, we employ a multi-view diffusion model that efficiently generates multi-view RGB in approximately 4 seconds. These multi-view images capture rich details of the 3D asset from different viewpoints, relaxing the tasks from single-view to multi-view reconstruction. In the second stage, we introduce a feed-forward reconstruction model that rapidly and faithfully reconstructs the 3D asset given the generated multi-view images in approximately 7 seconds. The reconstruction network learns to handle noises and in-consistency introduced by the multi-view diffusion and leverages the available information from the condition image to efficiently recover the 3D structure. Our framework involves the text-to-image model, i.e., Hunyuan-DiT, making it a unified framework to support both text- and image-conditioned 3D generation. Our standard version has 3x more parameters than our lite and other existing model. Our Hunyuan3D-1.0 achieves an impressive balance between speed and quality, significantly reducing generation time while maintaining the quality and diversity of the produced assets.
翻译:尽管3D生成模型极大地改善了艺术家的工作流程,但现有的3D生成扩散模型存在生成速度慢和泛化能力差的问题。为解决此问题,我们提出了一种名为Hunyuan3D-1.0的两阶段方法,包括轻量版和标准版,两者均支持文本条件和图像条件的生成。在第一阶段,我们采用多视角扩散模型,能够在大约4秒内高效生成多视角RGB图像。这些多视角图像从不同视点捕捉了3D资产的丰富细节,将任务从单视角重建放宽至多视角重建。在第二阶段,我们引入了一个前馈重建模型,能够在给定生成的多视角图像后,于大约7秒内快速且忠实地重建出3D资产。该重建网络学习处理多视角扩散引入的噪声与不一致性,并利用条件图像中的可用信息来高效恢复3D结构。我们的框架集成了文本到图像模型(即Hunyuan-DiT),使其成为一个支持文本和图像条件3D生成的统一框架。我们的标准版参数规模是轻量版及其他现有模型的3倍。Hunyuan3D-1.0在速度与质量之间取得了令人印象深刻的平衡,在显著缩短生成时间的同时,保持了所生成资产的质量与多样性。