Recent methods in text-to-3D leverage powerful pretrained diffusion models to optimize NeRF. Notably, these methods are able to produce high-quality 3D scenes without training on 3D data. Due to the open-ended nature of the task, most studies evaluate their results with subjective case studies and user experiments, thereby presenting a challenge in quantitatively addressing the question: How has current progress in Text-to-3D gone so far? In this paper, we introduce T$^3$Bench, the first comprehensive text-to-3D benchmark containing diverse text prompts of three increasing complexity levels that are specially designed for 3D generation. To assess both the subjective quality and the text alignment, we propose two automatic metrics based on multi-view images produced by the 3D contents. The quality metric combines multi-view text-image scores and regional convolution to detect quality and view inconsistency. The alignment metric uses multi-view captioning and Large Language Model (LLM) evaluation to measure text-3D consistency. Both metrics closely correlate with different dimensions of human judgments, providing a paradigm for efficiently evaluating text-to-3D models. The benchmarking results, shown in Fig. 1, reveal performance differences among six prevalent text-to-3D methods. Our analysis further highlights the common struggles for current methods on generating surroundings and multi-object scenes, as well as the bottleneck of leveraging 2D guidance for 3D generation. Our project page is available at: https://t3bench.com.
翻译:近期,文本到三维生成方法借助强大的预训练扩散模型优化神经辐射场(NeRF)。值得注意的是,这些方法无需在三维数据上训练即可生成高质量的三维场景。由于该任务的开放性,多数研究仅通过主观案例研究和用户实验评估结果,导致难以定量回答:文本到三维生成当前取得了多少进展?本文提出T$^3$Bench——首个综合性文本到三维基准,包含专门为三维生成设计的三级难度递增的多样化文本提示。为评估主观质量与文本对齐度,我们基于三维内容生成的多视角图像提出两种自动评估指标:质量指标结合多视角文本-图像分数与区域卷积检测质量及视角不一致性;对齐指标利用多视角描述与大型语言模型(LLM)评估衡量文本-三维一致性。两项指标均与人类判断的不同维度高度相关,为高效评估文本到三维模型提供范式。基准测试结果(图1)揭示了六种主流文本到三维方法的性能差异。我们的分析进一步凸显了当前方法在生成场景和多目标物体时的普遍困境,以及利用二维指导进行三维生成的瓶颈。项目页面:https://t3bench.com。