Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

翻译：近期如ChatGPT等AI助手代理主要依赖人工标注的监督微调（SFT）和基于人类反馈的强化学习（RLHF）来对齐大型语言模型（LLMs）的输出与人类意图，确保其输出具有帮助性、伦理性和可靠性。然而，这种依赖因获取人工监督的高昂成本以及相关质量、可靠性、多样性、自一致性和不良偏见等问题，可能严重制约AI助手代理的真正潜力。为解决这些挑战，我们提出一种名为SELF-ALIGN的新方法，该方法结合原则驱动推理和LLMs的生成能力，以最少人工监督实现AI代理的自我对齐。我们的方法包含四个阶段：首先，使用LLM生成合成提示，并通过主题引导方法增强提示多样性；其次，利用一小套人工编写的AI模型遵循原则，通过上下文学习（从原则应用的示范中）引导LLM对用户查询生成有帮助、符合伦理且可靠的回应；第三，使用高质量自我对齐回应微调原始LLM，使所得模型无需原则集和示范即可直接为每个查询生成理想回应；最后，我们提供精炼步骤以解决回应过于简短或间接的问题。将SELF-ALIGN应用于LLaMA-65b基础语言模型，我们开发了名为Dromedary的AI助手。仅用不到300行人工标注（包括<200个种子提示、16个通用原则和5个用于上下文学习的范例），Dromedary在多种设置的基准数据集上显著超越包括Text-Davinci-003和Alpaca在内的多个最先进AI系统性能。