Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.
翻译:大语言模型(LLMs)在评估AI系统生成答案质量方面展现出作为自动化评估器的潜力。然而,这些基于LLM的评估器在成对比较候选答案时会出现位置偏差或不一致性——即无论内容如何,会偏向于第一个或第二个答案。为克服这一局限,我们提出PORTIA系统,这是一种基于对齐的轻量级方案,通过模拟人类比较策略有效校准位置偏差。具体而言,PORTIA将答案分割为多个片段,对齐候选答案间的相似内容,再将它们合并为单一提示供LLM评估。我们使用六种不同LLM对11,520个答案对进行了广泛实验。结果表明,PORTIA显著提升了所有测试模型和比较形式的一致性比率,平均相对改进率达47.46%。值得注意的是,PORTIA使较低版本的GPT模型能以仅10%的成本达到与先进GPT-4模型88%的一致性。此外,该方案能够修正GPT-4模型中约80%的位置偏差实例,将其一致性比率提升至98%。后续人工评估显示,经过PORTIA增强的GPT-3.5模型在人类评估者一致性方面甚至超越了独立运行的GPT-4。这些发现凸显了PORTIA纠正位置偏差、提升LLM一致性及在保持成本效益的同时增强性能的能力。这为在不同应用中更可靠、可扩展地使用LLM进行自动化评估迈出了重要一步。