Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision

Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.

翻译：近期如ChatGPT等AI助手主要依赖人类标注的监督微调（SFT）和人类反馈强化学习（RLHF），使大语言模型（LLM）的输出与人类意图对齐，确保其具备有益性、伦理性和可靠性。然而，这种依赖可能显著限制AI助手的真正潜力，原因包括获取人类监督的高昂成本，以及相关数据在质量、可靠性、多样性、自洽性和不良偏见方面的问题。为应对这些挑战，我们提出名为SELF-ALIGN的新方法，该方法结合原则驱动推理与LLM的生成能力，在仅需最小人类监督的情况下实现AI代理的自对齐。我们的方法包含四个阶段：首先，使用LLM生成合成提示，并通过主题引导方法增强提示多样性；其次，利用少量人工编写的准则指导AI模型的行为，并通过示例上下文学习（演示原则应用）引导LLM生成对用户查询的有益、伦理且可靠的回复；第三，使用高质量的自对齐回复微调原始LLM，使生成模型在无原则集和示例的情况下也能直接为每个查询生成理想回复；最后，通过优化步骤解决回复过于简略或间接的问题。将SELF-ALIGN应用于LLaMA-65b基础语言模型后，我们开发了名为Dromedary的AI助手。在仅使用不到300行人类标注（包括<200条种子提示、16条通用原则和5个上下文学习示例）的情况下，Dromedary在多种基准测试中显著超越了包括Text-Davinci-003和Alpaca在内的多个先进AI系统。