The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.
翻译:传统的图像描述训练方法通常采用教师强制预训练网络,随后通过自临界序列训练进行微调,以最大化人工设计的描述指标。然而,当尝试优化如CLIP分数和PAC分数等现代且更高质量的指标时,这种训练方法常遇到不稳定性问题,且难以获得生成流畅且信息丰富描述所需的真实描述能力。本文提出一种称为基于CLIP直接优化的新训练范式。该方法通过在学习型描述评估器中蒸馏出具有高人类相关性的奖励模型,并直接在描述器内部求解加权分类问题,实现联合学习与优化。同时,DiCO能防止模型偏离原始架构,确保描述流畅性得以保持。与现有方法相比,DiCO不仅在生成描述的稳定性和质量上表现更优,且在现代指标中更符合人类偏好,同时在传统指标上保持竞争力。我们的源代码与训练模型已公开于https://github.com/aimagelab/DiCO。