Recent years have witnessed remarkable progress in 3D content generation. However, corresponding evaluation methods struggle to keep pace. Automatic approaches have proven challenging to align with human preferences, and the mixed comparison of text- and image-driven methods often leads to unfair evaluations. In this paper, we present a comprehensive framework to better align and evaluate multi-view diffusion models with human preferences. To begin with, we first collect and filter a standardized image prompt set from DALL$\cdot$E and Objaverse, which we then use to generate multi-view assets with several multi-view diffusion models. Through a systematic ranking pipeline on these assets, we obtain a human annotation dataset with 16k expert pairwise comparisons and train a reward model, coined MVReward, to effectively encode human preferences. With MVReward, image-driven 3D methods can be evaluated against each other in a more fair and transparent manner. Building on this, we further propose Multi-View Preference Learning (MVP), a plug-and-play multi-view diffusion tuning strategy. Extensive experiments demonstrate that MVReward can serve as a reliable metric and MVP consistently enhances the alignment of multi-view diffusion models with human preferences.
翻译:近年来,三维内容生成领域取得了显著进展,然而相应的评估方法却难以跟上发展步伐。现有自动评估方法已被证明难以与人类偏好保持一致,且文本驱动与图像驱动方法的混合比较常导致不公平的评估结果。本文提出一个综合性框架,以更好地基于人类偏好对齐与评估多视角扩散模型。首先,我们从DALL$\cdot$E和Objaverse中收集并筛选出标准化图像提示集,随后使用多个多视角扩散模型基于该提示集生成多视角资源。通过对这些资源进行系统性排序流程,我们获得了包含1.6万组专家两两比较的人类标注数据集,并训练出能有效编码人类偏好的奖励模型MVReward。借助MVReward,图像驱动的三维生成方法得以在更公平透明的框架下进行相互比较。在此基础上,我们进一步提出多视角偏好学习(MVP)——一种即插即用的多视角扩散模型调优策略。大量实验表明,MVReward可作为可靠的评估指标,而MVP能持续提升多视角扩散模型与人类偏好的对齐程度。