Chain-of-Thought (CoT) monitoring, in which automated systems monitor the CoT of an LLM, is a promising approach for effectively overseeing AI systems. However, the extent to which a model's CoT helps us oversee the model - the monitorability of the CoT - can be affected by training, for instance by the model learning to hide important features of its reasoning. We propose and empirically validate a conceptual framework for predicting when and why this occurs. We model LLM post-training as an RL environment where the reward decomposes into two terms: one term depending on final outputs and another term depending on the CoT. Our framework allows us to classify these two terms as "aligned", "orthogonal", or "in-conflict" before training. We predict that training with in-conflict terms will reduce monitorability, orthogonal terms will not affect it, and aligned terms will improve it. To validate our framework, we use it to classify a set of RL environments, train LLMs within those environments, and evaluate how training affects CoT monitorability. We find that (1) training with "in-conflict" reward terms reduces CoT monitorability and (2) optimizing in-conflict reward terms is difficult.
翻译:思维链(Chain-of-Thought, CoT)监控——即自动化系统监控大语言模型(LLM)的思维链——是一种有效监督AI系统的有前景方法。然而,模型的思维链帮助我们监督模型的程度(即思维链的可监控性)可能受到训练的影响,例如模型学会隐藏其推理过程中的重要特征。我们提出并实证验证了一个概念框架,用于预测这种情况何时发生及其原因。我们将LLM后训练建模为一个强化学习(RL)环境,其中奖励分解为两项:一项取决于最终输出,另一项取决于思维链。我们的框架允许我们在训练前将这两项分类为“对齐”、“正交”或“冲突”。我们预测:使用冲突项进行训练会降低可监控性,正交项不会影响可监控性,而对齐项则会提升可监控性。为验证该框架,我们将其应用于一组RL环境的分类,在这些环境中训练LLM,并评估训练对思维链可监控性的影响。我们发现:(1)使用“冲突”奖励项进行训练会降低思维链的可监控性;(2)优化冲突奖励项是困难的。