Self-Supervised Alignment with Mutual Information: Learning to Follow Principles without Preference Labels

When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.

翻译：在提示语言模型（LM）时，用户通常期望模型能在不同任务中遵循一套行为准则，例如生成富有洞察力的内容同时避免有害或偏颇的语言。将此类原则（即"宪法"）注入模型需要大量资源、技术挑战性高，且通常需要人工偏好标签或示例。我们提出SAMI算法——一种迭代式微调预训练语言模型的方法（无需偏好标签或演示），通过最大化数据集查询、宪法与模型自生成响应之间的条件互信息来优化模型。在单轮对话和摘要任务中，经SAMI训练的mistral-7b模型性能显著优于原始预训练模型，胜率介于66%至77%之间。更值得注意的是，其在单轮对话任务中甚至超过指令微调基线（mistral-7b-instruct），胜率达55%-57%。SAMI需要能撰写原则的模型。为避免依赖强模型编写原则，我们使用弱指令微调模型（mistral-7b-instruct）编写的宪法对齐强预训练模型（mixtral-8x7b），在摘要任务中实现65%的胜率。最后，我们探究SAMI能否泛化至多样化摘要原则（如"摘要应具有科学性"）并扩展至更强模型（llama3-70b），发现相较于基础模型，其对已学习原则的胜率达68%，对未学习原则的胜率达67%。实验表明，预训练语言模型无需偏好标签、演示或人工监督即可学习遵循宪法。