Large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like Chain-of-Thought (CoT). However, we find these CoT-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the Toxic CoT problem. To interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the LLM during CoT reasoning. Through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. Based on the probing findings, we design a novel method called RIDERS (Residual decodIng and sERial-position Swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. Through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates Toxic CoT problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
翻译:大型语言模型展现出高水平的常识推理能力,尤其是借助思维链(Chain-of-Thought, CoT)等增强方法。然而,我们发现这类CoT方法会导致大量原本正确的答案变为错误,我们将此定义为“有毒CoT问题”。为解释并缓解该问题,我们首先利用归因追踪和因果追踪方法,探究LLM在CoT推理过程中的内部工作机制。通过对比,我们证实模型在生成推理链或答案时,会在浅层注意力层上出现来自问题的信息损失。基于这些探测发现,我们设计了一种名为RIDERS(残差解码与序列位置交换)的新型方法,从解码和序列位置两个角度补偿模型的信息缺失。通过在多个常识推理基准上的广泛实验,我们验证了该方法不仅能显著消除有毒CoT问题(降低23.6%),还能有效提升模型的整体常识推理性能(提高5.5%)。