Recent advances in General Text-to-3D (GT23D) have been significant. However, the lack of a benchmark has hindered systematic evaluation and progress due to issues in datasets and metrics: 1) The largest 3D dataset Objaverse suffers from omitted annotations, disorganization, and low-quality. 2) Existing metrics only evaluate textual-image alignment without considering the 3D-level quality. To this end, we are the first to present a comprehensive benchmark for GT23D called GT23D-Bench consisting of: 1) a 400k high-fidelity and well-organized 3D dataset that curated issues in Objaverse through a systematical annotation-organize-filter pipeline; and 2) comprehensive 3D-aware evaluation metrics which encompass 10 clearly defined metrics thoroughly accounting for multi-dimension of GT23D. Notably, GT23D-Bench features three properties: 1) Multimodal Annotations. Our dataset annotates each 3D object with 64-view depth maps, normal maps, rendered images, and coarse-to-fine captions. 2) Holistic Evaluation Dimensions. Our metrics are dissected into a) Textual-3D Alignment measures textual alignment with multi-granularity visual 3D representations; and b) 3D Visual Quality which considers texture fidelity, multi-view consistency, and geometry correctness. 3) Valuable Insights. We delve into the performance of current GT23D baselines across different evaluation dimensions and provide insightful analysis. Extensive experiments demonstrate that our annotations and metrics are aligned with human preferences.
翻译:通用文本到3D(GT23D)领域近期取得了显著进展。然而,由于数据集和评估指标方面存在的问题,缺乏一个基准阻碍了系统性的评估与进步:1)最大的3D数据集Objaverse存在标注缺失、组织混乱和低质量的问题。2)现有指标仅评估文本-图像对齐度,而未考虑3D层面的质量。为此,我们首次提出了一个全面的GT23D基准,称为GT23D-Bench,它包含:1)一个包含40万个高保真且组织良好的3D数据集,该数据集通过系统化的标注-组织-筛选流程,解决了Objaverse中的问题;以及2)全面的3D感知评估指标,涵盖10个明确定义的指标,全面考量了GT23D的多个维度。值得注意的是,GT23D-Bench具有三个特性:1)多模态标注。我们的数据集为每个3D物体标注了64个视角的深度图、法线图、渲染图像以及从粗到细的描述文本。2)整体评估维度。我们的指标细分为:a) 文本-3D对齐度,衡量文本与多粒度视觉3D表示的对齐情况;以及b) 3D视觉质量,考量纹理保真度、多视图一致性和几何正确性。3)有价值的洞见。我们深入探究了当前GT23D基线模型在不同评估维度上的性能,并提供了深入的分析。大量实验表明,我们的标注和指标与人类偏好相符。