Synthesizing multi-view 3D from one single image is a significant and challenging task. For this goal, Zero-1-to-3 methods aim to extend a 2D latent diffusion model to the 3D scope. These approaches generate the target-view image with a single-view source image and the camera pose as condition information. However, the one-to-one manner adopted in Zero-1-to-3 incurs challenges for building geometric and visual consistency across views, especially for complex objects. We propose a cascade generation framework constructed with two Zero-1-to-3 models, named Cascade-Zero123, to tackle this issue, which progressively extracts 3D information from the source image. Specifically, a self-prompting mechanism is designed to generate several nearby views at first. These views are then fed into the second-stage model along with the source image as generation conditions. With self-prompted multiple views as the supplementary information, our Cascade-Zero123 generates more highly consistent novel-view images than Zero-1-to-3. The promotion is significant for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. The project page is at https://cascadezero123.github.io/.
翻译:从单张图像合成多视角3D是一项重要且具有挑战性的任务。为此,Zero-1-to-3方法旨在将2D潜在扩散模型扩展至3D领域。这类方法以单视角源图像和相机位姿作为条件信息生成目标视角图像。然而,Zero-1-to-3采用的一对一映射方式在构建跨视角几何与视觉一致性方面面临挑战,尤其对于复杂物体而言。我们提出由两个Zero-1-to-3模型构成的级联生成框架Cascade-Zero123,通过逐步从源图像中提取3D信息来解决该问题。具体而言,我们设计了一种自提示机制,首先生成若干近邻视图;随后将这些视图与源图像共同输入第二阶段模型作为生成条件。通过自提示多视图作为补充信息,Cascade-Zero123相较于Zero-1-to-3能生成一致性更强的新视角图像。该提升在昆虫、人体、透明物体、堆叠多物体等各类复杂场景中效果显著。项目页面地址为https://cascadezero123.github.io/。