Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of mis-ranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-Bench, AlignBench, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen's kappa), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost-accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.

翻译：大语言模型（LLM）的评估因自动化评判中存在的不一致性、偏见及缺乏透明决策标准而持续面临挑战。本文提出辩论、审议、决策（D3）框架，这是一种成本感知的对抗式多智能体框架，通过组织角色专精智能体（辩护方、法官及可选陪审团）进行结构化辩论，以产生可靠且可解释的评估结果。D3 实例化了两种互补协议：（1）多辩护方单轮评估（MORE），该协议为每个答案生成 k 个并行辩护以通过多样化辩护放大信号；（2）带预算停止的单辩护方多轮评估（SAMRE），该协议在明确的令牌预算和收敛性检查下迭代优化论证。我们建立了一个分数差距的概率模型，该模型（i）刻画了迭代辩论下的可靠性与收敛性，并（ii）解释了并行辩护带来的分离增益。在温和假设下，第 r 轮差距的后验分布集中于真实差异附近，且误排概率趋于零；此外，跨 k 个辩护方的聚合可证明地增加期望分数分离度。我们通过跨 MT-Bench、AlignBench 和 AUTO-J 的严谨实验套件对理论进行补充，结果显示：与人类评判达成最先进的吻合度（准确率与 Cohen's kappa）；通过匿名化与角色多样化降低了位置与冗长偏见；基于预算停止实现了优越的成本-准确率边界。消融研究与定性分析分离了辩论、聚合和匿名化的贡献。综上，这些结果确立了 D3 作为一种原则性、实用性的方法，可用于实现可靠、可解释且成本感知的 LLM 评估。