Large Language Models (LLMs) can improve their responses when instructed to do so, a capability known as self-correction. When these instructions lack specific details about the issues in the response, this is referred to as leveraging the intrinsic self-correction capability. The empirical success of self-correction can be found in various applications, e.g., text detoxification and social bias mitigation. However, leveraging this self-correction capability may not always be effective, as it has the potential to revise an initially correct response into an incorrect one. In this paper, we endeavor to understand how and why leveraging the self-correction capability is effective. We identify that appropriate instructions can guide LLMs to a convergence state, wherein additional self-correction steps do not yield further performance improvements. We empirically demonstrate that model uncertainty and activated latent concepts jointly characterize the effectiveness of self-correction. Furthermore, we provide a mathematical formulation indicating that the activated latent concept drives the convergence of the model uncertainty and self-correction performance. Our analysis can also be generalized to the self-correction behaviors observed in Vision-Language Models (VLMs). Moreover, we highlight that task-agnostic debiasing can benefit from our principle in terms of selecting effective fine-tuning samples. Such initial success demonstrates the potential extensibility for better instruction tuning and safety alignment.
翻译:大型语言模型(LLMs)在接收到相应指令时能够改进其回答,这种能力被称为自我修正。当这些指令未具体说明回答中存在的问题时,则被称为利用其内在自我修正能力。自我修正的经验性成功可见于多种应用场景,例如文本去毒化与社会偏见缓解。然而,利用这种自我修正能力并非总是有效的,因为它可能将原本正确的回答修改为错误的答案。本文旨在探究利用自我修正能力为何及如何生效。我们发现,恰当的指令能够引导LLMs达到一种收敛状态,在此状态下额外的自我修正步骤不会带来进一步的性能提升。我们通过实验证明,模型不确定性与激活的潜在概念共同决定了自我修正的有效性。此外,我们提出了一个数学表述,表明激活的潜在概念驱动着模型不确定性与自我修正性能的收敛。我们的分析同样可推广至视觉语言模型(VLMs)中观察到的自我修正行为。进一步地,我们指出任务无关的去偏方法能够依据我们的原则筛选有效的微调样本而获益。这一初步成果展示了该原则在改进指令微调与安全对齐方面具有潜在的可扩展性。