Large reasoning models (LRMs) produce a textual chain of thought (CoT) in the process of solving a problem, which serves as a potentially powerful tool to understand the problem by surfacing a human-readable, natural-language explanation. However, it is unclear whether these explanations generalize, i.e. whether they capture general patterns about the underlying problem rather than patterns which are esoteric to the LRM. This is a crucial question in understanding or discovering new concepts, e.g. in AI for science. We study this generalization question by evaluating a specific notion of generalizability: whether explanations produced by one LRM induce the same behavior when given to other LRMs. We find that CoT explanations often exhibit this form of generalization (i.e. they increase consistency between LRMs) and that this increased generalization is correlated with human preference rankings and post-training with reinforcement learning. We further analyze the conditions under which explanations yield consistent answers and propose a straightforward, sentence-level ensembling strategy that improves consistency. Taken together, these results prescribe caution when using LRM explanations to yield new insights and outline a framework for characterizing LRM explanation generalization.
翻译:大型推理模型(LRMs)在解决问题的过程中会产生文本形式的思维链(CoT),这种链式表达通过呈现人类可读的自然语言解释,成为理解问题的潜在有力工具。然而,目前尚不清楚这些解释是否具有泛化性,即它们是否捕捉到了底层问题的一般模式,而非仅反映LRM自身特有的隐晦模式。这对于理解或发现新概念(例如在科学人工智能领域)是一个关键问题。我们通过评估一种特定的泛化概念来研究这一泛化问题:由一个LRM生成的解释在输入其他LRM时是否能诱导出相同的行为。我们发现,CoT解释通常表现出这种形式的泛化(即它们能提高LRM之间的一致性),并且这种增强的泛化性与人类偏好排序及强化学习后训练相关。我们进一步分析了解释在何种条件下能产生一致答案,并提出了一种简单的句子级集成策略以提升一致性。综上所述,这些结果提示我们在使用LRM解释来获得新见解时需保持谨慎,并勾勒出一个用于表征LRM解释泛化能力的框架。