Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions

Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.

翻译：图像描述数据集在图像理解、文本到图像生成以及文本-图像检索等多种应用的发展中起着至关重要的作用。目前，图像描述数据集主要来源于两个渠道。其一是从网络抓取的图像-文本对。尽管数量庞大，但这些描述通常质量低下且含有噪声。其二是通过人工标注。诸如COCO等数据集的描述通常非常简短且缺乏细节。虽然详细的图像描述可以通过人工标注获得，但高昂的标注成本限制了其可行性。这些局限性凸显了对更高效、可扩展的方法来生成准确且详细的图像描述的需求。本文提出了一种创新的框架，称为图像文本化（IT），该框架通过协同利用现有的多模态大语言模型（MLLMs）和多个视觉专家模型，自动生成高质量的图像描述，从而最大限度地实现视觉信息到文本的转换。针对当前缺乏详细描述基准的问题，我们提出了多个用于全面评估的基准，验证了由我们框架创建的图像描述的质量。此外，我们展示了LLaVA-7B模型受益于在IT框架整理的数据上进行训练，获得了生成更丰富图像描述的能力，其输出的长度和细节显著增加，同时幻觉现象减少。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日