Multi-agent deliberation systems using large language models (LLMs) are increasingly proposed for policy simulation, yet they suffer from artificial consensus: evaluator agents converge on the same option regardless of their assigned value perspectives. We present the AI Council, a three-phase deliberation framework, and conduct 120 deliberations across two policy scenarios to test two interventions. First, architectural heterogeneity (assigning a different 7-9B parameter model to each value perspective) significantly reduces first-choice concentration compared to a homogeneous baseline (child welfare: 70.9% to 46.1%, p < 0.001, r = 0.58; housing: 46.0% to 22.9%, p < 0.001, r = 0.50). This contrasts with accuracy-oriented multi-agent debate, where heterogeneity does not reduce convergence, suggesting model diversity operates differently when no objectively correct answer exists. Second, coherence validation (using a frontier model to assess whether each evaluator's reasoning is grounded in its assigned values) reveals a fidelity-diversity tradeoff: on a scenario with a dominant option, it further reduces concentration (46.1% to 40.8%, p = 0.004), but on a scenario with genuinely competitive options, it increases concentration (22.9% to 26.6%, p = 0.96) by amplifying high-coherence evaluators who cluster on one option. This tradeoff may be a general property of multi-agent systems employing quality weighting. We report negative results from three failed Delphi designs, demonstrate that 8B models exhibit binary rather than graded responses to counter-arguments, and propose the trustworthy tension rate as a diagnostic measure of small-model deliberation capabilities.
翻译:使用大语言模型(LLMs)的多智能体协商系统日益被用于政策模拟,但其存在人为趋同问题:无论被赋予何种价值观视角,评估智能体最终都会收敛于同一选项。我们提出AI委员会这一三阶段协商框架,并针对两个政策场景开展120次协商实验,以检验两种干预措施的效果。其一,架构异质性(为每个价值观视角分配不同的7-9B参数模型)相较于同质基线显著降低了首选项集中度(儿童福利:70.9%降至46.1%,p<0.001,r=0.58;住房:46.0%降至22.9%,p<0.001,r=0.50)。这与面向准确性的多智能体辩论形成对比——后者的异质性并未降低收敛性,表明当不存在客观正确答案时,模型多样性具有不同作用机制。其二,一致性验证(使用前沿模型评估每个评估者的推理是否基于其被赋予的价值观)揭示了保真度-多样性的权衡:在存在主导选项的场景中,该措施进一步降低集中度(46.1%降至40.8%,p=0.004),但在存在真正竞争性选项的场景中,它通过放大聚集于同一选项的高一致性评估者而提高集中度(22.9%升至26.6%,p=0.96)。这种权衡可能是采用质量加权机制的多智能体系统的普遍特性。我们报告了三种失败德尔菲设计的负面结果,证明8B模型对反驳论点呈现二元而非梯度响应,并提出可信紧张率作为衡量小模型协商能力的诊断指标。