As language models continue to rapidly improve, we can expect their actions and reasoning to become difficult or impossible for weaker agents and humans to follow, undermining interpretability and oversight. With an eye on long-term futures, we pursue methods that encourage models to produce solutions that remain intelligible to weaker collaborators. We formalize intelligibility as handoff robustness: a strong model's solution is intelligible to a weaker model if randomly handing off control to the weaker model along the solution path does not cause failure. Building on this criterion, we introduce tandem training for language models, a reinforcement learning (RL) paradigm in which rollout tokens are intermittently and randomly sampled from a frozen weak model rather than the strong model being trained. Because rollouts succeed only when the strong model's actions and reasoning process can be continued by the weak model -- when the two can co-construct a successful solution -- optimizing standard RL objectives with tandem training implicitly incentivizes both correctness and intelligibility. In the GSM8K math reasoning task, tandem training reliably teaches models to abandon jargon and adapt their language to weaker partners while keeping task accuracy high. Our results demonstrate a promising route to building AI systems that remain auditable by weaker agents, with implications for human--AI collaboration and multi-agent communication.
翻译:随着语言模型能力的持续快速提升,其行为与推理过程可能变得令能力较弱的智能体或人类难以甚至无法理解,从而损害可解释性与监督能力。着眼于长远发展,我们探索能够促使模型生成对较弱协作方保持可理解性的解决方案的方法。我们将可理解性形式化为交接鲁棒性:若在解决方案路径上随机将控制权移交给较弱模型不会导致失败,则强模型的解决方案对该弱模型而言是可理解的。基于此标准,我们提出了语言模型的串联训练方法,这是一种强化学习范式,其中在训练强模型时,其生成的序列片段会间歇且随机地从冻结的弱模型中采样,而非从正在训练的强模型中采样。由于只有当强模型的行为与推理过程能够被弱模型延续时——即当两者能够协同构建成功的解决方案时——序列片段才能成功,因此通过串联训练优化标准强化学习目标,能够隐式地同时激励模型的正确性与可理解性。在GSM8K数学推理任务中,串联训练能可靠地引导模型放弃专业术语、调整语言以适应较弱协作方,同时保持较高的任务准确率。我们的研究结果为构建能够持续接受较弱智能体审计的人工智能系统提供了一条可行路径,这对人机协作与多智能体通信具有重要启示。