Reflect: Transparent Principle-Guided Reasoning for Constitutional Alignment at Scale

The constitutional framework of alignment aims to align large language models (LLMs) with value-laden principles written in natural language (such as to avoid using biased language). Prior work has focused on parameter fine-tuning techniques, such as reinforcement learning from human feedback (RLHF), to instill these principles. However, these approaches are computationally demanding, require careful engineering and tuning, and often require difficult-to-obtain human annotation data. We propose \textsc{reflect}, an inference-time framework for constitutional alignment that does not require any training or data, providing a plug-and-play approach for aligning an instruction-tuned model to a set of principles. \textsc{reflect} operates entirely in-context, combining a (i) constitution-conditioned base response with post-generation (ii) self-evaluation, (iii)(a) self-critique, and (iii)(b) final revision. \textsc{reflect}'s technique of explicit in-context reasoning over principles during post-generation outperforms standard few-shot prompting and provides transparent reasoning traces. Our results demonstrate that \textsc{reflect} significantly improves LLM conformance to diverse and complex principles, including principles quite distinct from those emphasized in the model's original parameter fine-tuning, without sacrificing factual reasoning. \textsc{reflect} is particularly effective at reducing the rate of rare but significant violations of principles, thereby improving safety and robustness in the tail end of the distribution of generations. Finally, we show that \textsc{reflect} naturally generates useful training data for traditional parameter fine-tuning techniques, allowing for efficient scaling and the reduction of inference-time computational overhead in long-term deployment scenarios.

翻译：宪法对齐框架旨在将大型语言模型（LLM）与用自然语言书写的、蕴含价值的准则（例如避免使用带有偏见的语言）对齐。先前的研究主要集中在参数微调技术上，例如基于人类反馈的强化学习（RLHF），以灌输这些准则。然而，这些方法计算成本高昂，需要精心的工程设计和调优，并且通常需要难以获取的人工标注数据。我们提出了 \textsc{reflect}，一种用于宪法对齐的推理时框架，它不需要任何训练或数据，为将指令微调模型与一组准则对齐提供了一种即插即用的方法。\textsc{reflect} 完全在上下文中运行，结合了 (i) 基于宪法准则的基础响应生成，以及生成后的 (ii) 自我评估、(iii)(a) 自我批判和 (iii)(b) 最终修订。\textsc{reflect} 在生成后对准则进行显式上下文推理的技术，其性能优于标准的少样本提示法，并能提供透明的推理轨迹。我们的结果表明，\textsc{reflect} 显著提高了 LLM 对多样且复杂准则的遵从性，包括那些与模型原始参数微调所强调的准则截然不同的原则，且不牺牲事实推理能力。\textsc{reflect} 在减少罕见但严重的准则违反率方面特别有效，从而提高了生成分布尾部的安全性和鲁棒性。最后，我们证明 \textsc{reflect} 能自然地生成用于传统参数微调技术的有用训练数据，从而在长期部署场景中实现高效扩展并降低推理时的计算开销。