Large language models (LLMs) have shown promise as automated evaluators for assessing the quality of answers generated by AI systems. However, these LLM-based evaluators exhibit position bias, or inconsistency, when used to evaluate candidate answers in pairwise comparisons, favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits the answers into multiple segments, aligns similar content across candidate answers, and then merges them back into a single prompt for evaluation by LLMs. We conducted extensive experiments with six diverse LLMs to evaluate 11,520 answer pairs. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested, achieving an average relative improvement of 47.46%. Remarkably, PORTIA enables less advanced GPT models to achieve 88% agreement with the state-of-the-art GPT-4 model at just 10% of the cost. Furthermore, it rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%. Subsequent human evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass the standalone GPT-4 in terms of alignment with human evaluators. These findings highlight PORTIA's ability to correct position bias, improve LLM consistency, and boost performance while keeping cost-efficiency. This represents a valuable step toward a more reliable and scalable use of LLMs for automated evaluations across diverse applications.
翻译:大语言模型(LLMs)在作为自动化评估器评判AI系统生成答案的质量方面展现出潜力。然而,这些基于LLM的评估器在进行成对比较评估候选答案时,表现出位置偏差或不一致性,即无论内容如何都倾向于第一个或第二个答案。为应对这一局限,我们提出了PORTIA,一种基于对齐的系统,旨在以轻量而有效的方式模拟人类比较策略来校准位置偏差。具体而言,PORTIA将答案分割为多个片段,在候选答案间对齐相似内容,然后将其合并回单一提示供LLMs评估。我们使用六种不同的LLMs进行了广泛实验,评估了11,520个答案对。结果表明,PORTIA显著提升了所有测试模型和比较形式的一致性率,平均相对改进达到47.46%。值得注意的是,PORTIA使性能较低的GPT模型仅以10%的成本即可达到与最先进的GPT-4模型88%的一致性。此外,它纠正了GPT-4模型中约80%的位置偏差实例,将其一致性率提升至98%。后续的人工评估表明,经PORTIA增强的GPT-3.5模型在与人评估者的一致性方面甚至能超越独立的GPT-4。这些发现凸显了PORTIA在纠正位置偏差、提升LLM一致性、提高性能并保持成本效益方面的能力。这标志着在更可靠、可扩展地将LLMs用于跨领域自动化评估方面迈出了重要一步。