Large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like Chain-of-Thought (CoT). However, we find these CoT-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the Toxic CoT problem. To interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the LLM during CoT reasoning. Through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. Based on the probing findings, we design a novel method called RIDERS (Residual decodIng and sERial-position Swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. Through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates Toxic CoT problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
翻译:大型语言模型展现出高水平的常识推理能力,特别是通过思维链等增强方法。然而,我们发现这类思维链方法会导致大量原本正确的答案变为错误,我们将其定义为毒性思维链问题。为解读并缓解此问题,我们首先利用归因追踪与因果追踪方法,探究大型语言模型在思维链推理过程中的内部工作机制。通过对比分析,我们证明模型在生成推理依据或答案时,在浅层注意力层会表现出对问题信息的丢失。基于这些探测发现,我们设计了一种名为RIDERS的新方法,该方法从解码与序列位置两个角度补偿模型的信息缺失。通过在多个常识推理基准上的大量实验,我们验证了该方法不仅能显著消除毒性思维链问题,还能有效提升模型的整体常识推理性能。