Recent progress in single-image 3D generation highlights the importance of multi-view coherency, leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However, the aspect of novel-view diversity remains underexplored within the research landscape due to the ambiguity in converting a 2D image into 3D content, where numerous potential shapes can emerge. Here, we aim to address this research gap by simultaneously addressing both consistency and diversity. Yet, striking a balance between these two aspects poses a considerable challenge due to their inherent trade-offs. This work introduces HarmonyView, a simple yet effective diffusion sampling technique adept at decomposing two intricate aspects in single-image 3D generation: consistency and diversity. This approach paves the way for a more nuanced exploration of the two critical dimensions within the sampling process. Moreover, we propose a new evaluation metric based on CLIP image and text encoders to comprehensively assess the diversity of the generated views, which closely aligns with human evaluators' judgments. In experiments, HarmonyView achieves a harmonious balance, demonstrating a win-win scenario in both consistency and diversity.
翻译:近期单图像三维生成研究的突破凸显了多视角一致性的重要性,其通过利用大规模扩散模型(基于互联网规模图像预训练)的三维先验知识。然而,由于二维图像到三维内容转换存在歧义性(可能涌现出多种潜在形状),新视角多样性问题在研究领域中仍未被充分探索。本文旨在通过同时解决一致性与多样性来填补这一研究空白。但两者固有的权衡特性使得平衡这两个维度面临重大挑战。本研究提出HarmonyView——一种简洁高效的扩散采样技术,该技术能够解耦单图像三维生成中两个复杂维度:一致性与多样性。该方法为在采样过程中更细致地探索这两个关键维度铺平了道路。此外,我们基于CLIP图像编码器和文本编码器提出新型评估指标,可全面评估生成视图的多样性,并与人工评估者的判断高度吻合。实验表明,HarmonyView实现了和谐的平衡,在一致性与多样性两方面均展现出双赢效果。