As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
翻译:随着人工智能模型从简单的聊天机器人发展为更复杂的工作流程,我们正日益接近一个临界点:超越该临界点后,AI系统将被用于自主、自我维持的反馈循环中。任何自主AI系统都将依赖于自动化且可验证的奖励与反馈机制;在真实标签稀缺或非确定性的场景中,一种实用的奖励来源是采用LLM作为裁判(LLM-as-a-Judge)。尽管LLM裁判持续改进,现有文献尚未能引入具备强保证标准的强制执行系统,尤其是在偏差向量未知或被对抗性发现的情况下。为解决这一问题,我们提出了平均偏差有界性(A-BB)算法框架,该框架能形式化地保证LLM裁判中任何可测量偏差所造成的危害/影响得到降低。通过在Arena-Hard-Auto数据集上对四种LLM裁判进行评估,我们在格式与图式偏差设置中实现了(τ=0.5,δ=0.01)的偏差有界保证,同时保持与原始排名61-99%的相关性,且大多数裁判-偏差组合的相关性超过80%。重现本研究成果的代码已发布于https://github.com/penfever/bias-bounded-evaluation。