Reinforcement Learning from Human Feedback (RLHF) is a vital strategy for enhancing model safety in language models. However, annotating preference data for RLHF is a resource-intensive and creativity-demanding process, while automatic generation methods face limitations in data diversity and quality. In response, we present Safer-Instruct, a novel pipeline for semi-automatically constructing large-scale preference datasets. Our approach leverages reversed instruction tuning, instruction induction, and expert model evaluation to efficiently generate high-quality preference data without human annotators. We evaluate Safer-Instruct using LLaMA for instruction induction and GPT-4 as an expert model, generating approximately 10K preference samples. Finetuning an Alpaca model on this dataset demonstrates improved harmlessness while maintaining competitive performance on conversation and downstream tasks. Safer-Instruct addresses the challenges in preference data acquisition, advancing the development of safer and more responsible AI systems. Our code and data are available at https://github.com/uscnlp-lime/safer-instruct
翻译:基于人类反馈的强化学习(RLHF)是提升语言模型安全性的关键策略。然而,为RLHF标注偏好数据是一项资源密集且需要创造力的过程,而自动化生成方法在数据多样性和质量方面存在局限。为此,我们提出Safer-Instruct——一种半自动化构建大规模偏好数据集的新型流程。该方法通过逆向指令微调、指令归纳和专家模型评估,无需人工标注即可高效生成高质量偏好数据。我们采用LLaMA进行指令归纳、GPT-4作为专家模型评估Safer-Instruct,生成了约1万条偏好样本。基于该数据集对Alpaca模型进行微调表明,模型在对话和下游任务中保持竞争性能的同时,无害性得到显著提升。Safer-Instruct解决了偏好数据获取的挑战,推动了更安全、更负责任的AI系统的发展。我们的代码和数据已开源至https://github.com/uscnlp-lime/safer-instruct