We propose Reinforcement Learning from Contrast Distillation (RLCD), a method for aligning language models to follow natural language principles without using human feedback. RLCD trains a preference model using simulated preference pairs that contain both a high-quality and low-quality example, generated using contrasting positive and negative prompts. The preference model is then used to improve a base unaligned language model via reinforcement learning. Empirically, RLCD outperforms RLAIF (Bai et al., 2022b) and context distillation (Huang et al., 2022) baselines across three diverse alignment tasks--harmlessness, helpfulness, and story outline generation--and on both 7B and 30B model scales for preference data simulation.
翻译:我们提出了一种基于对比蒸馏的强化学习(RLCD)方法,用于在不使用人类反馈的情况下对齐语言模型以遵循自然语言原则。RLCD通过使用对比正负提示生成的模拟偏好对(包含高质量和低质量示例)来训练偏好模型,然后利用该偏好模型通过强化学习改进基础未对齐的语言模型。实验表明,在三个不同的对齐任务(无害性、有益性和故事大纲生成)上,以及在7B和30B两种规模的偏好数据模拟模型中,RLCD均优于RLAIF(Bai等人,2022b)和上下文蒸馏(Huang等人,2022)基线方法。