The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
翻译:作为可扩展监督技术的AI辩论,其核心前提在于:令人信服地撒谎比驳斥谎言更为困难,从而使裁判能够识别正确立场。然而,现有的辩论实验依赖于具有事实基准的数据集,其中撒谎被简化为捍卫错误命题。这忽略了一个主观维度:撒谎还需以"所辩护的主张为假"这一信念为前提。在本研究中,我们将辩论应用于主观性问题,并在实验前明确测量大语言模型的先验信念。我们要求辩手选择其倾向立场,随后向其展示一个经过刻意设计、与其已识别先验信念相冲突的裁判角色设定。该实验设计旨在检验模型是否会采取迎合策略——即通过顺应裁判的预设视角来最大化说服力,抑或坚守其先验信念。我们实施并比较了顺序辩论与同步辩论两种协议,以评估潜在的系统性偏差。最后,我们评估了模型在捍卫与其先验信念一致的立场时,是否比论证相反立场时更具说服力且能产生更高质量的论据。主要研究结果表明:模型倾向于选择捍卫与裁判角色设定一致的立场而非其先验信念;顺序辩论会引入显著偏向第二位辩手的偏差;模型在捍卫与其先验信念一致的立场时更具说服力;但矛盾的是,在成对比较中,与先验信念不一致的论证反而被评价为更高质量。这些发现可为人类裁判提供更高质量的训练信号,助力构建更对齐的AI系统,同时揭示了人机交互中关于语言模型说服力动态的重要特征。