Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.
翻译:现有视觉内容自动描述方法面临细节缺失、内容幻觉及指令遵循能力不足等挑战。本文提出VisualFactChecker (VFC) ——一种灵活的无训练流水线,能够为二维图像和三维物体生成高保真度的细粒度描述。VFC包含三个步骤:1) 提案阶段,由图像到文本描述模型生成多个初始描述候选;2) 验证阶段,利用大型语言模型 (LLM) 调用目标检测和视觉问答等工具对提案描述进行事实核查;3) 描述生成阶段,LLM通过综合描述提案与事实核查结果生成最终描述。在此步骤中,VFC可灵活遵循复杂指令生成多种风格的描述。我们采用四项指标进行全面评估:1) CLIP-Score衡量图文相似度;2) CLIP-Image-Score衡量原始图像与基于描述生成的文本到图像重建结果之间的图像相似度;3) Amazon Mechanical Turk平台人工评估;4) GPT-4V细粒度评估。评估结果表明,VFC在COCO数据集二维图像和Objaverse数据集三维物体的描述生成任务中,均优于当前最先进的开源描述方法。本研究证明,通过组合开源模型构建流水线,虽然模型规模缩小超10倍,仍可获得与GPT-4V等专有模型相当的描述生成能力。