Utilizing Large Language Models (LLMs) as evaluators for evaluating the performance of LLMs has recently garnered attention. However, this kind of evaluation approach is affected by potential biases in LLMs, raising concerns about the accuracy and reliability of the evaluation results. To mitigate this issue, we propose and study two many-shot ICL prompts, which rely on two versions of many-shot ICL prompt templates for helping LLM evaluators to mitigate the potential biases in LLMs, \textbf{M}any-\textbf{S}hot \textbf{w}ith \textbf{R}eference (\textbf{MSwR}) and \textbf{M}any-\textbf{S}hot with\textbf{o}ut \textbf{R}eference (\textbf{MSoR}). Concretely, the former utilizes in-context examples with model-generated rationales as guidance, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Furthermore, we reveal the symbol bias hidden in the selection bias of LLMs and propose a simple yet effective approach to mitigate the bias. Experimental results further verify the effectiveness of the symbol bias mitigation approach.
翻译:利用大语言模型作为评估器来评价大语言模型的性能近来备受关注。然而,此类评估方法受到大语言模型中潜在偏见的影响,引发了对其评估结果准确性与可靠性的担忧。为缓解此问题,我们提出并研究两种大样本上下文学习提示,其依赖于两个版本的大样本上下文学习提示模板,旨在帮助大语言模型评估器减轻模型中的潜在偏见:\textbf{带参考的大样本}与\textbf{无参考的大样本}。具体而言,前者利用附带模型生成推理过程的上下文示例作为指导,后者则不包含此类参考。基于所设计的提示,我们研究了上下文示例数量扩展对评估结果一致性与质量的影响。实验结果表明,先进的大语言模型在多样本机制下的表现优于零样本机制。此外,我们揭示了大语言模型选择偏见中隐含的符号偏见,并提出一种简单而有效的方法来缓解该偏见。实验结果进一步验证了符号偏见缓解方法的有效性。