Large Language Models (LLMs) need to be aligned with human expectations to ensure their safety and utility in most applications. Alignment is challenging, costly, and needs to be repeated for every LLM and alignment criterion. We propose to decouple LLMs and alignment by training aligner models that can be used to align any LLM for a given criteria on an as-needed basis, thus also reducing the potential negative impacts of alignment on performance. Our recipe for training the aligner models solely relies on synthetic data generated with a (prompted) LLM and can be easily adjusted for a variety of alignment criteria. We illustrate our method by training an "ethical" aligner and verify its efficacy empirically.
翻译:大语言模型(LLMs)需要与人类期望对齐,以确保其在大多数应用中的安全性和实用性。对齐过程既具挑战性又成本高昂,且需要针对每个大语言模型和对齐标准重复进行。我们提出通过训练对齐器模型来解耦大语言模型与对齐过程,这些模型可按需用于将任何大语言模型与特定标准对齐,从而同时减少对齐对性能的潜在负面影响。我们的对齐器模型训练方案完全依赖于(经提示的)大语言模型生成的合成数据,并可轻松适配多种对齐标准。我们通过训练"伦理"对齐器实例化该方法,并通过实验验证了其有效性。