Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduce Alignment from Demonstrations (AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.
翻译:对齐大型语言模型(LLM)对于提升其安全性和实用性至关重要。然而,现有主要基于偏好数据集的方法面临诸多挑战,例如标注噪声、高昂的标注成本以及隐私问题。本研究提出基于演示的对齐方法(AfD),这是一种利用高质量演示数据来克服上述挑战的新颖方法。我们在序列决策框架内形式化AfD,并着重指出其缺乏奖励信号这一独特挑战。借鉴正向与逆向强化学习的思路,我们为AfD提出了散度最小化目标。通过理论分析,我们阐明了不同方法所表现出的质量覆盖与模式寻求行为,从而解释了特定方法在何时及为何更具优势。在实践层面,我们提出了一种计算高效的算法,该算法基于为AfD定制的奖励模型进行外推。我们在无害性与助人性任务上通过实验验证了核心观点,结果表明所提方法在保持简洁性的同时具有优异的实证性能。