Model-based evaluation metrics (e.g., CLIPScore and GPTScore) have demonstrated decent correlations with human judgments in various language generation tasks. However, their impact on fairness remains largely unexplored. It is widely recognized that pretrained models can inadvertently encode societal biases, thus employing these models for evaluation purposes may inadvertently perpetuate and amplify biases. For example, an evaluation metric may favor the caption "a woman is calculating an account book" over "a man is calculating an account book," even if the image only shows male accountants. In this paper, we conduct a systematic study of gender biases in model-based automatic evaluation metrics for image captioning tasks. We start by curating a dataset comprising profession, activity, and object concepts associated with stereotypical gender associations. Then, we demonstrate the negative consequences of using these biased metrics, including the inability to differentiate between biased and unbiased generations, as well as the propagation of biases to generation models through reinforcement learning. Finally, we present a simple and effective way to mitigate the metric bias without hurting the correlations with human judgments. Our dataset and framework lay the foundation for understanding the potential harm of model-based evaluation metrics, and facilitate future works to develop more inclusive evaluation metrics.
翻译:基于模型的评估指标(如CLIPScore和GPTScore)已在各种语言生成任务中展现出与人类判断的良好相关性。然而,它们对公平性的影响仍鲜有探讨。众所周知,预训练模型可能不经意地编码社会偏见,因此将这些模型用于评估任务可能会无意中延续和放大偏见。例如,评估指标可能更倾向于描述"一名女性正在计算账本"而非"一名男性正在计算账本",即使图像中只有男性会计。本文系统研究了图像描述任务中基于模型的自动评估指标的性别偏见。我们首先构建了一个包含与刻板性别关联的职业、活动和物品概念的数据集。随后,我们展示了使用这些有偏指标带来的负面后果,包括无法区分有偏和无偏生成结果,以及通过强化学习将偏见传播至生成模型。最后,我们提出了一种简单有效的缓解指标偏见的方法,且不损害与人类判断的相关性。我们的数据集和框架为理解基于模型的评估指标的潜在危害奠定了基础,并促进未来开发更具包容性的评估指标。