Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
翻译:大型语言模型(LLM)正日益被部署于联络中心的质量保障(QA)中,以自动化座席绩效评估与辅导反馈生成。尽管LLM提供了前所未有的可扩展性与速度,但其对网络规模训练数据的依赖引发了关于人口统计和行为偏见的担忧,这些偏见可能扭曲劳动力评估。我们对基于LLM的QA系统在涵盖三个类别(身份、情境和行为风格)的13个维度上进行了反事实公平性评估。公平性通过反事实翻转率(CFR,即二元判断反转的频率)和平均绝对分数差异(MASD,即反事实配对间辅导或置信度评分的平均偏移)进行量化。基于对3,000份真实联络中心对话记录评估18个LLM,我们发现了系统性差异:CFR介于5.4%至13.0%之间,且在置信度、积极性和改进性评分上均观察到一致的MASD偏移。更大规模、经过更强对齐的模型表现出较低的不公平性,但公平性并不与准确性同步变化。基于历史表现的情境提示引发了最严重的公平性下降(CFR最高达16.4%),而隐性的语言身份线索仍是持续的偏见来源。最后,我们分析了公平性感知提示的有效性,发现明确的指令仅能有限地提升评估一致性。我们的研究结果强调了在高风险劳动力评估中部署LLM之前,建立标准化公平性审计流程的必要性。