Image description datasets play a crucial role in the advancement of various applications such as image understanding, text-to-image generation, and text-image retrieval. Currently, image description datasets primarily originate from two sources. One source is the scraping of image-text pairs from the web. Despite their abundance, these descriptions are often of low quality and noisy. Another is through human labeling. Datasets such as COCO are generally very short and lack details. Although detailed image descriptions can be annotated by humans, the high annotation cost limits the feasibility. These limitations underscore the need for more efficient and scalable methods to generate accurate and detailed image descriptions. In this paper, we propose an innovative framework termed Image Textualization (IT), which automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models in a collaborative manner, which maximally convert the visual information into text. To address the current lack of benchmarks for detailed descriptions, we propose several benchmarks for comprehensive evaluation, which verifies the quality of image descriptions created by our framework. Furthermore, we show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions, substantially increasing the length and detail of their output with less hallucination.
翻译:图像描述数据集在图像理解、文本到图像生成以及文本-图像检索等多种应用的发展中起着至关重要的作用。目前,图像描述数据集主要来源于两个渠道。其一是从网络抓取的图像-文本对。尽管数量庞大,但这些描述通常质量低下且含有噪声。其二是通过人工标注。诸如COCO等数据集的描述通常非常简短且缺乏细节。虽然详细的图像描述可以通过人工标注获得,但高昂的标注成本限制了其可行性。这些局限性凸显了对更高效、可扩展的方法来生成准确且详细的图像描述的需求。本文提出了一种创新的框架,称为图像文本化(IT),该框架通过协同利用现有的多模态大语言模型(MLLMs)和多个视觉专家模型,自动生成高质量的图像描述,从而最大限度地实现视觉信息到文本的转换。针对当前缺乏详细描述基准的问题,我们提出了多个用于全面评估的基准,验证了由我们框架创建的图像描述的质量。此外,我们展示了LLaVA-7B模型受益于在IT框架整理的数据上进行训练,获得了生成更丰富图像描述的能力,其输出的长度和细节显著增加,同时幻觉现象减少。