As AI systems are used to answer more difficult questions and potentially help create new knowledge, judging the truthfulness of their outputs becomes more difficult and more important. How can we supervise unreliable experts, which have access to the truth but may not accurately report it, to give answers that are systematically true and don't just superficially seem true, when the supervisor can't tell the difference between the two on their own? In this work, we show that debate between two unreliable experts can help a non-expert judge more reliably identify the truth. We collect a dataset of human-written debates on hard reading comprehension questions where the judge has not read the source passage, only ever seeing expert arguments and short quotes selectively revealed by 'expert' debaters who have access to the passage. In our debates, one expert argues for the correct answer, and the other for an incorrect answer. Comparing debate to a baseline we call consultancy, where a single expert argues for only one answer which is correct half of the time, we find that debate performs significantly better, with 84% judge accuracy compared to consultancy's 74%. Debates are also more efficient, being 68% of the length of consultancies. By comparing human to AI debaters, we find evidence that with more skilled (in this case, human) debaters, the performance of debate goes up but the performance of consultancy goes down. Our error analysis also supports this trend, with 46% of errors in human debate attributable to mistakes by the honest debater (which should go away with increased skill); whereas 52% of errors in human consultancy are due to debaters obfuscating the relevant evidence from the judge (which should become worse with increased skill). Overall, these results show that debate is a promising approach for supervising increasingly capable but potentially unreliable AI systems.
翻译:随着人工智能系统被用于回答更困难的问题并可能帮助创造新知识,判断其输出内容的真实性变得愈发困难且重要。当监督者无法自行区分真相与表面真实时,我们该如何监督那些能够接触真相但可能不准确报告的不可靠专家,使其给出的答案系统性地真实而非仅表面看似真实?本研究表明,两位不可靠专家之间的辩论能帮助非专家裁判更可靠地识别真相。我们收集了一个关于困难阅读理解问题的人类撰写辩论数据集,其中裁判未阅读源文本,仅能看到通过"专家"辩手选择性揭示的简短引述及论点(这些辩手可接触源文本)。在辩论中,一位专家为正确答案辩护,另一位则为错误答案辩护。将辩论与基线方法(我们称之为"咨询模式",即单个专家仅论证单一答案且该答案有一半概率正确)对比,我们发现辩论表现显著更优:裁判准确率达84%,而咨询模式仅为74%。辩论效率也更高,其长度仅为咨询模式的68%。通过比较人类与AI辩手,我们获得证据表明:当辩手技能更娴熟时(本研究中指人类辩手),辩论表现会提升,而咨询模式表现则会下降。我们的错误分析也支持这一趋势:人类辩论中46%的错误源于诚实辩手的失误(此类错误会随技能提升而减少);而人类咨询模式中52%的错误源于辩手混淆法官对相关证据的认知(此类错误会随技能提升而恶化)。总体而言,这些结果表明辩论是监督能力日益增强但潜在不可靠AI系统的有前景方法。