Synthesizing multi-view 3D from one single image is a significant but challenging task. Zero-1-to-3 methods have achieved great success by lifting a 2D latent diffusion model to the 3D scope. The target view image is generated with a single-view source image and the camera pose as condition information. However, due to the high sparsity of the single input image, Zero-1-to-3 tends to produce geometry and appearance inconsistency across views, especially for complex objects. To tackle this issue, we propose to supply more condition information for the generation model but in a self-prompt way. A cascade framework is constructed with two Zero-1-to-3 models, named Cascade-Zero123, which progressively extract 3D information from the source image. Specifically, several nearby views are first generated by the first model and then fed into the second-stage model along with the source image as generation conditions. With amplified self-prompted condition images, our Cascade-Zero123 generates more consistent novel-view images than Zero-1-to-3. Experiment results demonstrate remarkable promotion, especially for various complex and challenging scenes, involving insects, humans, transparent objects, and stacked multiple objects etc. More demos and code are available at https://cascadezero123.github.io.
翻译:从单张图像合成多视图三维模型是一项重要但极具挑战性的任务。Zero-1-to-3系列方法通过将二维隐扩散模型提升至三维领域取得了显著成功。该方法以单视图源图像和相机位姿作为条件信息来生成目标视图图像。然而,由于单张输入图像的信息高度稀疏,Zero-1-to-3在跨视图间容易产生几何与外观不一致的问题,对于复杂物体尤为明显。为解决此问题,我们提出以自提示方式为生成模型提供更丰富的条件信息。我们构建了一个包含两个Zero-1-to-3模型的级联框架,命名为Cascade-Zero123,该框架能够从源图像中渐进式提取三维信息。具体而言,首阶段模型先生成若干邻近视图,随后将这些视图与源图像共同作为条件输入至第二阶段模型。通过增强的自提示条件图像,我们的Cascade-Zero123能够生成比Zero-1-to-3具有更高一致性的新视角图像。实验结果表明该方法取得了显著提升,尤其适用于各类复杂挑战性场景,包括昆虫、人体、透明物体及堆叠多物体等。更多演示与代码详见https://cascadezero123.github.io。