Is Moral Self-correction An Innate Capability of Large Language Models? A Mechanistic Analysis to Self-correction

Though intensive attentions to the self-correction capability of Large Language Models (LLMs), the underlying mechanism of this capability is still under-explored. In this paper, we aim to answer two fundamental questions for moral self-correction: (1) how different components in self-correction, such as Chain-of-Thought (CoT) reasoning, external feedback, and instructional prompts, interact to enable moral self-correction; and (2) is the self-correction one of LLMs' innate capabilities? To answer the first question, we examine how different self-correction components interact to intervene the embedded morality within hidden states, therefore contributing to different performance. For the second question, we (i) evaluate the robustness of moral self-correction by introducing natural language interventions of weak evidence into prompts; (ii) propose a validation framework, self-distinguish, that requires effective self-correction to enable LLMs to distinguish between desirable and undesirable outputs. Our experimental results indicate that there is no universally optimal self-correction method for the tasks considered, although external feedback and CoT can contribute to additional performance gains. However, our mechanistic analysis reveals negative interactions among instructional prompts, CoT, and external feedback, suggesting a conflict between internal knowledge and external feedback. The self-distinguish experiments demonstrate that while LLMs can self-correct their responses, they are unable to reliably distinguish between desired and undesired outputs. With our empirical evidence, we can conclude that moral self-correction is not an innate capability of LLMs acquired during pretraining.

翻译：尽管大语言模型（LLMs）的自我纠正能力受到广泛关注，但其内在机制仍未得到充分探索。本文旨在回答关于道德自我纠正的两个基本问题：（1）自我纠正中的不同组成部分，如思维链（CoT）推理、外部反馈和指令提示，如何相互作用以实现道德自我纠正；（2）自我纠正是否是LLMs的内在能力之一？针对第一个问题，我们研究了不同自我纠正组件如何相互作用以干预隐藏状态中嵌入的道德性，从而影响性能表现。对于第二个问题，我们（i）通过在提示中引入弱证据的自然语言干预来评估道德自我纠正的鲁棒性；（ii）提出一个验证框架——自我区分，要求有效的自我纠正能使LLMs区分期望和不期望的输出。实验结果表明，对于所考虑的任务，不存在普遍最优的自我纠正方法，尽管外部反馈和CoT可以带来额外的性能提升。然而，我们的机制分析揭示了指令提示、CoT和外部反馈之间存在负向交互，表明内部知识与外部反馈之间存在冲突。自我区分实验表明，虽然LLMs能够自我纠正其响应，但它们无法可靠地区分期望和不期望的输出。基于我们的实证证据，可以得出结论：道德自我纠正并非LLMs在预训练期间获得的内在能力。