We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
翻译:我们验证了一个假设:通过人类反馈强化学习(RLHF)训练的语言模型,在得到明确指令时,具有“道德自我纠正”的能力——即避免产生有害输出。通过三项揭示道德自我纠正不同方面的实验,我们找到了支持该假设的有力证据。研究发现,道德自我纠正能力在22B参数规模的模型中开始显现,并通常随模型规模增大和RLHF训练而提升。我们认为,在此规模下语言模型获得了两种可用于道德自我纠正的能力:(1)遵循指令的能力;(2)学习刻板印象、偏见和歧视等复杂规范性伤害概念的能力。因此,它们能够按照指令避免产生特定类型的道德有害输出。我们认为,这些结果有理由让我们对训练语言模型遵守伦理原则的能力持谨慎乐观态度。