Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.

翻译：现有视觉内容自动描述方法面临细节缺失、内容幻觉及指令遵循能力不足等挑战。本文提出VisualFactChecker (VFC) ——一种灵活的无训练流水线，能够为二维图像和三维物体生成高保真度的细粒度描述。VFC包含三个步骤：1) 提案阶段，由图像到文本描述模型生成多个初始描述候选；2) 验证阶段，利用大型语言模型 (LLM) 调用目标检测和视觉问答等工具对提案描述进行事实核查；3) 描述生成阶段，LLM通过综合描述提案与事实核查结果生成最终描述。在此步骤中，VFC可灵活遵循复杂指令生成多种风格的描述。我们采用四项指标进行全面评估：1) CLIP-Score衡量图文相似度；2) CLIP-Image-Score衡量原始图像与基于描述生成的文本到图像重建结果之间的图像相似度；3) Amazon Mechanical Turk平台人工评估；4) GPT-4V细粒度评估。评估结果表明，VFC在COCO数据集二维图像和Objaverse数据集三维物体的描述生成任务中，均优于当前最先进的开源描述方法。本研究证明，通过组合开源模型构建流水线，虽然模型规模缩小超10倍，仍可获得与GPT-4V等专有模型相当的描述生成能力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日