We test the hypothesis that language models trained with reinforcement learning from human feedback (RLHF) have the capability to "morally self-correct" -- to avoid producing harmful outputs -- if instructed to do so. We find strong evidence in support of this hypothesis across three different experiments, each of which reveal different facets of moral self-correction. We find that the capability for moral self-correction emerges at 22B model parameters, and typically improves with increasing model size and RLHF training. We believe that at this level of scale, language models obtain two capabilities that they can use for moral self-correction: (1) they can follow instructions and (2) they can learn complex normative concepts of harm like stereotyping, bias, and discrimination. As such, they can follow instructions to avoid certain kinds of morally harmful outputs. We believe our results are cause for cautious optimism regarding the ability to train language models to abide by ethical principles.
翻译:我们测试了以下假设:通过人类反馈强化学习(RLHF)训练的具有"道德自我修正"能力——即被指示时避免产生有害输出——的语言模型。我们通过三项不同实验找到了支持这一假设的有力证据,每项实验分别揭示了道德自我修正的不同方面。研究发现,道德自我修正能力出现在220亿参数规模的模型中,通常随模型规模增大和RLHF训练而提升。我们认为,在此规模水平上,语言模型获得了可用于道德自我修正的两项能力:(1)遵循指令的能力;(2)学习刻板印象、偏见和歧视等复杂规范性伤害概念的能力。因此,它们能够遵循指令避免产生某些类型的道德有害输出。我们认为,这些结果对训练语言模型遵循伦理原则的能力持审慎乐观态度提供了依据。