Recent multi-modal large language models (MLLMs) often struggle to generate personalized image captions, even when trained on high-quality captions. In this work, we observe that such limitations persist in existing post-training-based MLLM personalization methods. Specifically, despite being post-tuned with large-scale caption data through supervised fine-tuning (SFT), these models frequently fail to produce faithful descriptions in real-world scenarios, such as multi-concept image captioning. However, acquiring large-scale, high-quality captions for such complex settings is both costly and difficult. To address the data-centric nature of SFT, we propose a reinforcement learning (RL)-based post-training framework. To the best of our knowledge, this is the first RL-based approach to post-train MLLMs for personalized image captioning. Our method significantly enhances both visual recognition and personalized generation capabilities of MLLMs, and consistently outperforms existing SFT-based baselines, especially in the challenging multi-concept image captioning task. Project page: https://github.com/oyt9306/RePIC
翻译:当前的多模态大语言模型(MLLMs)即使在高质量描述数据上训练后,仍常常难以生成个性化的图像描述。本研究发现,这种局限性在现有的基于后训练的MLLM个性化方法中依然存在。具体而言,尽管这些模型通过监督微调(SFT)使用大规模描述数据进行了后调优,但在真实场景(如多概念图像描述)中仍经常无法生成忠实描述。然而,为此类复杂场景获取大规模、高质量的描述数据既昂贵又困难。为应对SFT对数据依赖的本质,我们提出了一种基于强化学习(RL)的后训练框架。据我们所知,这是首个采用RL方法对MLLMs进行后训练以实现个性化图像描述的工作。我们的方法显著提升了MLLMs的视觉识别与个性化生成能力,并在各项评估中持续超越现有基于SFT的基线方法,尤其是在具有挑战性的多概念图像描述任务上。项目页面:https://github.com/oyt9306/RePIC