With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges' collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.
翻译:随着推理能力的进步,大语言模型越来越多地被用于自动化评判任务。尽管LLM即评判者方法在自动化评估方面展现出前景,但当前方法通常依赖于简单的聚合策略(例如多数投票),即使在单个智能体提供正确答案的情况下也可能失效。为解决这一问题,我们提出了一种多智能体辩论式评判框架,其中智能体通过协作推理迭代优化其响应。我们以数学形式对辩论过程进行建模,分析智能体间的交互作用,并证明相较于静态集成方法,辩论能够放大正确性。为提升效率,我们引入了一种稳定性检测机制,该机制通过时变Beta-二项混合分布对评判者共识动态进行建模,并基于分布相似性(Kolmogorov-Smirnov检验)实现自适应停止。该机制使用时变Beta-二项混合分布对评判者集体正确率动态进行建模,并采用基于分布相似性(Kolmogorov-Smirnov统计量)的自适应停止准则。在多个基准测试和模型上的实验表明,我们的框架在保持计算效率的同时,显著提升了相较于多数投票方法的评判准确性。