Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

In this paper, we introduce Era3D, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image. Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images. Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails. Moreover, the full-image or dense multiview attention they employ leads to an exponential explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs. To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows our method to generate images without shape distortions. Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion. Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512*512 resolution while reducing computation complexity by 12x times. Comprehensive experiments demonstrate that Era3D can reconstruct high-quality and detailed 3D meshes from diverse single-view input images, significantly outperforming baseline multiview diffusion methods. Project page: https://penghtyx.github.io/Era3D/.

翻译：本文提出Era3D，一种新颖的多视角扩散方法，能够从单视角图像生成高分辨率多视角图像。尽管多视角生成领域已取得显著进展，现有方法仍普遍存在相机先验失配、效率低下和分辨率不足等问题，导致生成的多视角图像质量欠佳。具体而言，这些方法通常假设输入图像应符合预定义的相机类型（例如固定焦距的透视相机），当该假设不成立时会导致形状失真。此外，它们所采用的全图像或密集多视角注意力机制会随着图像分辨率的提升引发计算复杂度的指数级爆炸，导致训练成本极其高昂。为弥合假设与现实之间的差距，Era3D首先提出基于扩散的相机预测模块来估计输入图像的焦距和仰角，从而使我们的方法能够生成无形状失真的图像。进一步地，我们采用一种简洁而高效的行注意力层在多视角扩散过程中强化极几何先验，实现高效的跨视角信息融合。实验表明，相较于现有先进方法，Era3D能够生成分辨率高达512*512的高质量多视角图像，同时将计算复杂度降低12倍。综合实验证明，Era3D能够从多样化的单视角输入图像重建出高质量、高细节度的三维网格，显著优于基线多视角扩散方法。项目页面：https://penghtyx.github.io/Era3D/。