LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
翻译:LLM作为评判者已在众多研究和实际应用中得到广泛采用,但其评估的鲁棒性和可靠性仍是一个关键问题。其面临的核心挑战是偏见,目前研究主要集中在已知偏见及其对评估结果的影响上,而针对潜在未知偏见的自动化、系统性探索仍然缺乏。然而,此类探索对于提升评估的鲁棒性和可靠性至关重要。为弥补这一空白,我们提出了BiasScope,这是一个由LLM驱动的框架,用于自动、大规模地发现模型评估过程中可能出现的潜在偏见。BiasScope能够揭示不同模型家族和规模中的潜在偏见,其通用性和有效性已在JudgeBench数据集上得到验证。它克服了现有方法的局限性,将偏见发现从依赖人工努力和预定义偏见列表的被动过程,转变为主动、全面的自动化探索。此外,基于BiasScope,我们提出了JudgeBench-Pro,这是JudgeBench的扩展版本,也是一个更具挑战性的基准,用于评估LLM作为评判者的鲁棒性。引人注目的是,即使是作为评估者的强大LLM,在JudgeBench-Pro上的错误率也超过了50\%,这突显了加强评估鲁棒性和进一步缓解潜在偏见的迫切需求。