Large Language Models (LLMs) are increasingly being used to autonomously evaluate the quality of content in communication systems, e.g., to assess responses in telecom customer support chatbots. However, the impartiality of these AI "judges" is not guaranteed, and any biases in their evaluation criteria could skew outcomes and undermine user trust. In this paper, we systematically investigate judgment biases across 6 LLM-as-a-judge models spanning both prompt-based and fine-tuned judges under the pointwise scoring setting, encompassing 11 types of biases that cover both implicit and explicit forms. We observed that state-of-the-art LLM judges demonstrate robustness to biased inputs, generally assigning them lower scores than the corresponding clean samples. We further found that fine-tuning an LLM on high-scoring yet biased responses can significantly degrade its performance, highlighting the risk of training on biased data. We also discovered that the judged scores correlate with task difficulty: a challenging dataset like GPQA yields lower average scores, whereas an open-ended reasoning dataset (e.g., JudgeLM-val) sees higher average scores. Finally, we proposed four potential mitigation strategies to ensure fair and reliable AI judging in practical communication scenarios.
翻译:大型语言模型(LLM)正日益被用于自主评估通信系统中内容的质量,例如评估电信客户支持聊天机器人的回复。然而,这些AI“评判者”的公正性无法得到保证,其评估标准中的任何偏见都可能扭曲结果并损害用户信任。本文系统研究了6种LLM作为评判者模型在逐点评分设置下的评判偏见,涵盖基于提示和微调的两类评判者,涉及11种包含隐性和显性形式的偏见类型。我们观察到,最先进的LLM评判者对带有偏见的输入表现出鲁棒性,通常为其分配的分数低于相应的干净样本。我们进一步发现,对LLM使用高分但带有偏见的回复进行微调会显著降低其性能,这凸显了在偏见数据上训练的风险。我们还发现,评判分数与任务难度相关:像GPQA这样的挑战性数据集平均得分较低,而开放式推理数据集(如JudgeLM-val)则获得较高的平均得分。最后,我们提出了四种潜在的缓解策略,以确保在实际通信场景中实现公平可靠的AI评判。