3D generation is experiencing rapid advancements, while the development of 3D evaluation has not kept pace. How to keep automatic evaluation equitably aligned with human perception has become a well-recognized challenge. Recent advances in the field of language and image generation have explored human preferences and showcased respectable fitting ability. However, the 3D domain still lacks such a comprehensive preference dataset over generative models. To mitigate this absence, we develop 3DGen-Arena, an integrated platform in a battle manner. Then, we carefully design diverse text and image prompts and leverage the arena platform to gather human preferences from both public users and expert annotators, resulting in a large-scale multi-dimension human preference dataset 3DGen-Bench. Using this dataset, we further train a CLIP-based scoring model, 3DGen-Score, and a MLLM-based automatic evaluator, 3DGen-Eval. These two models innovatively unify the quality evaluation of text-to-3D and image-to-3D generation, and jointly form our automated evaluation system with their respective strengths. Extensive experiments demonstrate the efficacy of our scoring model in predicting human preferences, exhibiting a superior correlation with human ranks compared to existing metrics. We believe that our 3DGen-Bench dataset and automated evaluation system will foster a more equitable evaluation in the field of 3D generation, further promoting the development of 3D generative models and their downstream applications.
翻译:三维生成领域正经历快速发展,而三维评估的发展却未能同步跟进。如何使自动评估与人类感知保持公平一致已成为公认的挑战。语言与图像生成领域的最新进展已对人类偏好展开探索,并展现出可观的拟合能力。然而,三维领域仍缺乏此类覆盖生成模型的综合性偏好数据集。为弥补这一空白,我们开发了以对战形式运行的集成平台3DGen-Arena。随后,我们精心设计了多样化的文本与图像提示词,并利用该对战平台收集来自公众用户与专业标注员的人类偏好数据,从而构建了大规模、多维度的人类偏好数据集3DGen-Bench。基于此数据集,我们进一步训练了基于CLIP的评分模型3DGen-Score与基于MLLM的自动评估器3DGen-Eval。这两个模型创新性地统一了文本到三维与图像到三维生成的质量评估,并凭借各自优势共同构成了我们的自动化评估系统。大量实验证明,我们的评分模型在预测人类偏好方面具有显著效能,相较于现有指标,其与人类排序结果展现出更优的相关性。我们相信,3DGen-Bench数据集与自动化评估系统将推动三维生成领域建立更公平的评估体系,进一步促进三维生成模型及其下游应用的发展。