User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning

Image captioning bridges the gap between vision and language by automatically generating natural language descriptions for images. Traditional image captioning methods often overlook the preferences and characteristics of users. Personalized image captioning solves this problem by incorporating user prior knowledge into the model, such as writing styles and preferred vocabularies. Most existing methods emphasize the user context fusion process by memory networks or transformers. However, these methods ignore the distinct domains of each dataset. Therefore, they need to update the entire caption model parameters when meeting new samples, which is time-consuming and calculation-intensive. To address this challenge, we propose a novel personalized image captioning framework that leverages user context to consider personality factors. Additionally, our framework utilizes the prefix-tuning paradigm to extract knowledge from a frozen large language model, reducing the gap between different language domains. Specifically, we employ CLIP to extract the visual features of an image and align the semantic space using a query-guided mapping network. By incorporating the transformer layer, we merge the visual features with the user's contextual prior knowledge to generate informative prefixes. Moreover, we employ GPT-2 as the frozen large language model. With a small number of parameters to be trained, our model performs efficiently and effectively. Our model outperforms existing baseline models on Instagram and YFCC100M datasets across five evaluation metrics, demonstrating its superiority, including twofold improvements in metrics such as BLEU-4 and CIDEr.

翻译：图像描述技术通过自动生成图像的自然语言描述，架起了视觉与语言之间的桥梁。传统的图像描述方法往往忽视用户的偏好与特征。个性化图像描述通过将用户先验知识（如写作风格与偏好词汇）融入模型来解决这一问题。现有方法大多通过记忆网络或Transformer强调用户上下文融合过程，但这些方法忽略了各数据集的领域差异性。因此，当遇到新样本时，它们需要更新整个描述模型的参数，导致耗时且计算密集。为应对这一挑战，我们提出了一种新颖的个性化图像描述框架，该框架利用用户上下文来考量个性化因素。此外，我们的框架采用前缀调优范式从冻结的大型语言模型中提取知识，以缩小不同语言领域间的差距。具体而言，我们使用CLIP提取图像视觉特征，并通过查询引导的映射网络对齐语义空间。借助Transformer层，我们将视觉特征与用户上下文先验知识融合以生成信息丰富的前缀。同时，我们采用GPT-2作为冻结的大型语言模型。通过仅训练少量参数，我们的模型实现了高效且卓越的性能。在Instagram和YFCC100M数据集上，我们的模型在五项评估指标中均优于现有基线模型，其优越性在BLEU-4和CIDEr等指标上体现为两倍以上的提升。