While text-conditional 3D object generation and manipulation have seen rapid progress, the evaluation of coherence between generated 3D shapes and input textual descriptions lacks a clear benchmark. The reason is twofold: a) the low quality of the textual descriptions in the only publicly available dataset of text-shape pairs; b) the limited effectiveness of the metrics used to quantitatively assess such coherence. In this paper, we propose a comprehensive solution that addresses both weaknesses. Firstly, we employ large language models to automatically refine textual descriptions associated with shapes. Secondly, we propose a quantitative metric to assess text-to-shape coherence, through cross-attention mechanisms. To validate our approach, we conduct a user study and compare quantitatively our metric with existing ones. The refined dataset, the new metric and a set of text-shape pairs validated by the user study comprise a novel, fine-grained benchmark that we publicly release to foster research on text-to-shape coherence of text-conditioned 3D generative models. Benchmark available at https://cvlab-unibo.github.io/CrossCoherence-Web/.
翻译:尽管文本条件驱动的三维物体生成与操控技术取得了快速进展,但生成的三维形状与输入文本描述之间一致性的评估仍缺乏明确基准。其原因有二:一是公开可用的文本-形状对数据集中文本描述质量较低;二是用于定量评估此类一致性的指标有效性有限。本文提出一种综合解决方案以应对上述两个缺陷。首先,我们利用大语言模型自动优化与形状相关联的文本描述;其次,通过交叉注意力机制提出了一种评估文本-形状一致性的定量指标。为验证本方法,我们开展了用户研究,并将所提指标与现有指标进行定量比较。经优化的数据集、新指标以及经用户研究验证的文本-形状对共同构成一个新颖的细粒度基准,我们已将其公开以促进文本条件驱动三维生成模型中文本-形状一致性的研究。基准数据集访问地址:https://cvlab-unibo.github.io/CrossCoherence-Web/。