We introduce Cap3D, an automatic approach for generating descriptive text for 3D objects. This approach utilizes pretrained models from image captioning, image-text alignment, and LLM to consolidate captions from multiple views of a 3D asset, completely side-stepping the time-consuming and costly process of manual annotation. We apply Cap3D to the recently introduced large-scale 3D dataset, Objaverse, resulting in 660k 3D-text pairs. Our evaluation, conducted using 41k human annotations from the same dataset, demonstrates that Cap3D surpasses human-authored descriptions in terms of quality, cost, and speed. Through effective prompt engineering, Cap3D rivals human performance in generating geometric descriptions on 17k collected annotations from the ABO dataset. Finally, we finetune Text-to-3D models on Cap3D and human captions, and show Cap3D outperforms; and benchmark the SOTA including Point-E, Shape-E, and DreamFusion.
翻译:我们提出Cap3D,一种自动为三维对象生成描述性文本的方法。该方法利用图像字幕生成、图像-文本对齐和大语言模型(LLM)中的预训练模型,整合三维资产的多视角字幕,完全避免了耗时且昂贵的人工标注过程。我们将Cap3D应用于近期推出的大规模三维数据集Objaverse,生成了66万个三维-文本对。基于同一数据集中的41000条人工标注进行的评估表明,Cap3D在质量、成本和速度方面均超越人工撰写的描述。通过有效的提示工程,Cap3D在ABO数据集的17000条标注上生成的几何描述性能与人类相当。最后,我们在Cap3D和人类标注上微调文本到三维模型,结果显示Cap3D性能更优;并对包括Point-E、Shape-E和DreamFusion在内的现有最优方法进行了基准测试。