While Chain-of-Thought (CoT) monitoring offers a unique opportunity for AI safety, this opportunity could be lost through shifts in training practices or model architecture. To help preserve monitorability, we propose a pragmatic way to measure two components of it: legibility (whether the reasoning can be followed by a human) and coverage (whether the CoT contains all the reasoning needed for a human to also produce the final output). We implement these metrics with an autorater prompt that enables any capable LLM to compute the legibility and coverage of existing CoTs. After sanity-checking our prompted autorater with synthetic CoT degradations, we apply it to several frontier models on challenging benchmarks, finding that they exhibit high monitorability. We present these metrics, including our complete autorater prompt, as a tool for developers to track how design decisions impact monitorability. While the exact prompt we share is still a preliminary version under ongoing development, we are sharing it now in the hopes that others in the community will find it useful. Our method helps measure the default monitorability of CoT - it should be seen as a complement, not a replacement, for the adversarial stress-testing needed to test robustness against deliberately evasive models.
翻译:尽管思维链(CoT)监控为人工智能安全提供了独特机遇,但这一机遇可能因训练实践或模型架构的转变而丧失。为帮助保持可监控性,我们提出一种实用方法来衡量其两个组成部分:可读性(人类能否理解推理过程)和覆盖度(思维链是否包含人类同样得出最终输出所需的所有推理步骤)。我们通过一个自动评分提示来实现这些指标,该提示使任何具备能力的LLM能够计算现有思维链的可读性与覆盖度。在使用合成的思维链退化数据对我们的提示式自动评分器进行合理性检验后,我们将其应用于多个前沿模型在挑战性基准测试中的表现,发现它们展现出较高的可监控性。我们将这些指标(包括完整的自动评分器提示)作为工具提供给开发者,以追踪设计决策对可监控性的影响。尽管我们分享的具体提示仍是持续开发中的初步版本,但我们现予以公开,希望社区其他研究者能从中受益。我们的方法有助于衡量思维链的默认可监控性——它应被视为对抗性压力测试的补充而非替代,后者用于测试针对故意规避型模型的鲁棒性。