We test the robustness of debate as a method of scalable oversight by training models to debate with data generated via self-play. In a long-context reading comprehension task, we find that language model based evaluators answer questions more accurately when judging models optimized to win debates. By contrast, we find no such relationship for consultancy models trained to persuade a judge without an opposing debater present. In quantitative and qualitative comparisons between our debate models and novel consultancy baselines, we find evidence that debate training encourages stronger and more informative arguments, showing promise that it can help provide high-quality supervision for tasks that are difficult to directly evaluate.
翻译:我们通过训练模型利用自我博弈生成的数据进行辩论,以测试辩论作为可扩展监督方法的鲁棒性。在一项长上下文阅读理解任务中,发现基于语言模型的评估器在评判经过优化以赢得辩论的模型时,其问题回答准确率更高。相比之下,对于训练用于在没有对立辩手的情况下说服评判者的咨询模型,则未发现此类关联。通过对比我们的辩论模型与新型咨询基线模型的定量与定性分析,我们发现证据表明辩论训练能促进更强有力且信息量更丰富的论证,这预示着该方法有望为难以直接评估的任务提供高质量的监督。