Benchmarking and Improving Detail Image Caption

Image captioning has long been regarded as a fundamental task in visual understanding. Recently, however, few large vision-language model (LVLM) research discusses model's image captioning performance because of the outdated short-caption benchmarks and unreliable evaluation metrics. In this work, we propose to benchmark detail image caption task by curating high-quality evaluation datasets annotated by human experts, GPT-4V and Gemini-1.5-Pro. We also design a more reliable caption evaluation metric called CAPTURE (CAPtion evaluation by exTracting and coUpling coRE information). CAPTURE extracts visual elements, e.g., objects, attributes and relations from captions, and then matches these elements through three stages, achieving the highest consistency with expert judgements over other rule-based or model-based caption metrics. The proposed benchmark and metric provide reliable evaluation for LVLM's detailed image captioning ability. Guided by this evaluation, we further explore to unleash LVLM's detail caption capabilities by synthesizing high-quality data through a five-stage data construction pipeline. Our pipeline only uses a given LVLM itself and other open-source tools, without any human or GPT-4V annotation in the loop. Experiments show that the proposed data construction strategy significantly improves model-generated detail caption data quality for LVLMs with leading performance, and the data quality can be further improved in a self-looping paradigm. All code and dataset will be publicly available at https://github.com/foundation-multimodal-models/CAPTURE.

翻译：图像描述长期以来被视为视觉理解的基础任务。然而，近期由于过时的短描述基准测试和不可靠的评估指标，少有大规模视觉语言模型研究探讨模型的图像描述性能。本工作中，我们提出通过构建由人类专家、GPT-4V和Gemini-1.5-Pro标注的高质量评估数据集，对细节图像描述任务进行基准测试。我们还设计了一种更可靠的描述评估指标CAPTURE（通过提取与耦合核心信息进行描述评估）。CAPTURE从描述中提取视觉元素（如对象、属性及关系），随后通过三个阶段匹配这些元素，实现了与专家判断相比基于规则或基于模型的描述指标最高的吻合度。所提出的基准测试与指标为大规模视觉语言模型的细节图像描述能力提供了可靠评估。在此评估引导下，我们进一步探索通过五阶段数据构建流程合成高质量数据，以释放大规模视觉语言模型的细节描述能力。我们的流程仅使用给定的大规模视觉语言模型本身及其他开源工具，无需任何人工或GPT-4V标注介入。实验表明，所提出的数据构建策略显著提升了具有领先性能的大规模视觉语言模型生成的细节描述数据质量，且数据质量可在自循环范式中持续优化。所有代码与数据集将在https://github.com/foundation-multimodal-models/CAPTURE 公开。