While deep-learning models have been shown to perform well on image-to-text datasets, it is difficult to use them in practice for captioning images. This is because captions traditionally tend to be context-dependent and offer complementary information about an image, while models tend to produce descriptions that describe the visual features of the image. Prior research in caption generation has explored the use of models that generate captions when provided with the images alongside their respective descriptions or contexts. We propose and evaluate a new approach, which leverages existing large language models to generate captions from textual descriptions and context alone, without ever processing the image directly. We demonstrate that after fine-tuning, our approach outperforms current state-of-the-art image-text alignment models like OSCAR-VinVL on this task on the CIDEr metric.
翻译:摘要:尽管深度学习模型在图像到文本数据集上表现良好,但在实际中将其用于图像字幕生成仍面临困难。这是因为传统字幕通常依赖于上下文,并提供关于图像的补充信息,而模型往往倾向于生成描述图像视觉特征的描述。先前的字幕生成研究探索了基于图像及其对应描述或上下文的模型生成字幕的方法。我们提出并评估了一种新方法,该方法仅利用现有大型语言模型从文本描述和上下文中生成字幕,而无需直接处理图像。我们证明,经过微调后,我们的方法在CIDEr指标上超越了当前最先进的图像-文本对齐模型(如OSCAR-VinVL)在该任务上的表现。