Text-only Image Captioning (TIC) is an approach that aims to construct a model solely based on text that can accurately describe images. Recently, diffusion models have demonstrated remarkable capabilities in generating high-quality images that are semantically coherent with given texts. This presents an opportunity to generate synthetic training images for TIC. However, we have identified a challenge that the images generated from simple descriptions typically exhibit a single perspective with one or limited contexts, which is not aligned with the complexity of real-world scenes in the image domain. In this paper, we propose a novel framework that addresses this issue by introducing multi-context data generation. Starting with an initial text corpus, our framework employs a large language model to select multiple sentences that describe the same scene from various perspectives. These sentences are then summarized into a single sentence with multiple contexts. We generate simple images using the straightforward sentences and complex images using the summarized sentences through diffusion models. Finally, we train the model exclusively using the synthetic image-text pairs obtained from this process. Experimental results demonstrate that our proposed framework effectively tackles the central challenge we have identified, achieving the state-of-the-art performance on popular datasets such as MSCOCO, Flickr30k, and SS1M.
翻译:纯文本图像描述(TIC)是一种旨在仅基于文本构建模型,使其能够准确描述图像的方法。近年来,扩散模型在生成与给定文本语义一致的高质量图像方面展现出卓越能力,这为TIC提供了生成合成训练图像的机会。然而,我们识别出一个挑战:从简单描述生成的图像通常仅呈现单一视角且上下文有限,这与图像领域中现实场景的复杂性不符。本文提出一种新颖框架,通过引入多上下文数据生成来解决该问题。从初始文本语料出发,该框架利用大型语言模型从不同视角选取描述同一场景的多条句子,随后将这些句子总结为包含多个上下文的单条句子。我们通过扩散模型使用直白句子生成简单图像,使用总结句子生成复杂图像。最后,仅利用该过程获得的合成图像-文本对训练模型。实验结果表明,所提框架有效解决了我们识别的核心挑战,在MSCOCO、Flickr30k和SS1M等流行数据集上实现了最先进的性能。