Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object's characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.
翻译:可扩展的标注方法对于构建大规模三维文本数据集至关重要,从而推动更广泛的应用场景。然而,现有方法有时会产生幻觉化字幕,降低字幕质量。本文探讨三维物体字幕生成中的幻觉问题,重点关注Cap3D方法——该方法将三维物体渲染为二维视图,并利用预训练模型生成字幕。我们揭示了一个关键挑战:某些三维物体的渲染视图存在非典型性,偏离了标准图像字幕生成模型的训练数据,从而引发幻觉。为解决这一问题,我们提出DiffuRank方法,利用预训练的文本到三维模型评估三维物体与其二维渲染视图之间的对齐程度,其中高对齐度的视图能准确反映物体特征。通过对所有渲染视图进行排序,并将排名靠前的视图输入GPT4-Vision,我们提升了字幕的准确性与细节,从而修正了Cap3D数据集中20万条字幕,并将其扩展至Objaverse和Objaverse-XL数据集中的100万条字幕。此外,我们展示了DiffuRank的适应性:将其应用于预训练的文本到图像模型以执行视觉问答任务,其表现优于CLIP模型。