Given a single image of a 3D object, this paper proposes a novel method (named ConsistNet) that is able to generate multiple images of the same object, as if seen they are captured from different viewpoints, while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a multi-view consistency block which enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model, and consists of two sub-modules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infer consistency, and (b) a ray aggregation module that samples and aggregate 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation, in that it can be easily dropped-in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU. Our code will be made available on https://github.com/JiayuYANG/ConsistNet
翻译:给定单张三维物体图像,本文提出一种名为ConsistNet的新方法,该方法能生成同一物体的多张图像,使其呈现从不同视角拍摄的效果,同时有效利用生成图像间的三维(多视角)一致性。该方法的核心是多视角一致性模块,该模块基于多视角几何原理实现多个单视角扩散过程间的信息交换。ConsistNet是对标准潜在扩散模型的扩展,包含两个子模块:(a)视角聚合模块,将多视角特征反投影至全局三维体并推断一致性;(b)射线聚合模块,对三维一致性特征进行采样并聚合回各视角以增强一致性。本方法与以往多视角图像生成方法的不同之处在于:它可轻松插入预训练的LDM模型,无需显式像素对应关系或深度预测。实验表明,该方法能在冻结Zero123骨干网络上有效学习三维一致性,并在单个A100 GPU上于40秒内生成物体周围16个视角的图像。代码将发布于 https://github.com/JiayuYANG/ConsistNet。