Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study, we propose Polos, a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs, using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos, we introduce Multimodal Metric Learning from Human Feedback (M$^2$LHF), a framework for developing metrics based on human feedback. We constructed the Polaris dataset, which comprises 131K human judgments from 550 evaluators, which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite, Flickr8K-Expert, Flickr8K-CF, PASCAL-50S, FOIL, and the Polaris dataset, thereby demonstrating its effectiveness and robustness.
翻译:摘要:建立与人类判断高度一致的自动评估度量标准,对于有效开发图像描述生成模型至关重要。近期数据驱动的度量标准相比CIDEr等经典度量标准,与人类判断的相关性更强;然而,它们缺乏充分处理幻觉的能力,也未能有效泛化至多样化的图像与文本,部分原因在于它们仅通过图像描述评估无关任务中学习的嵌入来计算标量相似度。本研究提出Polos——一种用于图像描述模型的监督式自动评估度量。Polos采用并行特征提取机制,利用通过大规模对比学习训练的嵌入,从多模态输入中计算分数。为训练Polos,我们引入基于人类反馈的多模态度量学习框架(M$^2$LHF),该框架基于人类反馈开发度量标准。我们构建了Polaris数据集,包含来自550名评估者的13.1万条人类判断,其规模约为标准数据集的十倍。我们的方法在Composite、Flickr8K-Expert、Flickr8K-CF、PASCAL-50S、FOIL及Polaris数据集上均达到最优性能,证明了其有效性与鲁棒性。