基于自适应稳定性检测的多智能体辩论式大语言模型评判框架 (Multi-Agent Debate for LLM Judges with Adaptive Stability Detection)

With advancements in reasoning capabilities, Large Language Models (LLMs) are increasingly employed for automated judgment tasks. While LLMs-as-Judges offer promise in automating evaluations, current approaches often rely on simplistic aggregation methods (e.g., majority voting), which can fail even when individual agents provide correct answers. To address this, we propose a multi-agent debate judge framework where agents collaboratively reason and iteratively refine their responses. We formalize the debate process mathematically, analyzing agent interactions and proving that debate amplifies correctness compared to static ensembles. To enhance efficiency, we introduce a stability detection mechanism that models judge consensus dynamics via a time-varying Beta-Binomial mixture, with adaptive stopping based on distributional similarity (Kolmogorov-Smirnov test). This mechanism models the judges' collective correct rate dynamics using a time-varying mixture of Beta-Binomial distributions and employs an adaptive stopping criterion based on distributional similarity (Kolmogorov-Smirnov statistic). Experiments across multiple benchmarks and models demonstrate that our framework improves judgment accuracy over majority voting while maintaining computational efficiency.

翻译：随着推理能力的进步，大语言模型越来越多地被用于自动化评判任务。尽管LLM即评判者方法在自动化评估方面展现出前景，但当前方法通常依赖于简单的聚合策略（例如多数投票），即使在单个智能体提供正确答案的情况下也可能失效。为解决这一问题，我们提出了一种多智能体辩论式评判框架，其中智能体通过协作推理迭代优化其响应。我们以数学形式对辩论过程进行建模，分析智能体间的交互作用，并证明相较于静态集成方法，辩论能够放大正确性。为提升效率，我们引入了一种稳定性检测机制，该机制通过时变Beta-二项混合分布对评判者共识动态进行建模，并基于分布相似性（Kolmogorov-Smirnov检验）实现自适应停止。该机制使用时变Beta-二项混合分布对评判者集体正确率动态进行建模，并采用基于分布相似性（Kolmogorov-Smirnov统计量）的自适应停止准则。在多个基准测试和模型上的实验表明，我们的框架在保持计算效率的同时，显著提升了相较于多数投票方法的评判准确性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日