This study explores current limitations of learned image captioning evaluation metrics, specifically the lack of granular assessments for errors within captions, and the reliance on single-point quality estimates without considering uncertainty. To address the limitations, we propose a simple yet effective strategy for generating and calibrating distributions of CLIPScore values. Leveraging a model-agnostic conformal risk control framework, we calibrate CLIPScore values for task-specific control variables, tackling the aforementioned limitations. Experimental results demonstrate that using conformal risk control, over score distributions produced with simple methods such as input masking, can achieve competitive performance compared to more complex approaches. Our method effectively detects erroneous words, while providing formal guarantees aligned with desired risk levels. It also improves the correlation between uncertainty estimations and prediction errors, thus enhancing the overall reliability of caption evaluation metrics.
翻译:本研究探讨了当前基于学习的图像描述评估指标的局限性,具体表现为缺乏对描述文本内部错误的细粒度评估,以及依赖单一质量估计值而未考虑不确定性。为应对这些局限,我们提出了一种简单而有效的策略,用于生成并校准CLIPScore值的分布。通过利用一个与模型无关的保形风险控制框架,我们针对特定任务的控制变量校准CLIPScore值,从而解决了上述问题。实验结果表明,与输入掩码等简单方法产生的分数分布相结合,使用保形风险控制能够达到与更复杂方法相竞争的性能。我们的方法能有效检测错误词汇,同时提供与期望风险水平一致的形式化保证。该方法还改善了不确定性估计与预测误差之间的相关性,从而提升了描述评估指标的整体可靠性。