When prompting a language model (LM), users often expect the model to adhere to a set of behavioral principles across diverse tasks, such as producing insightful content while avoiding harmful or biased language. Instilling such principles (i.e., a constitution) into a model is resource-intensive, technically challenging, and generally requires human preference labels or examples. We introduce SAMI, an iterative algorithm that finetunes a pretrained language model (without requiring preference labels or demonstrations) to increase the conditional mutual information between constitutions and self-generated responses given queries from a dataset. On single-turn dialogue and summarization, a SAMI-trained mistral-7b outperforms the initial pretrained model, with win rates between 66% and 77%. Strikingly, it also surpasses an instruction-finetuned baseline (mistral-7b-instruct) with win rates between 55% and 57% on single-turn dialogue. SAMI requires a model that writes the principles. To avoid dependence on strong models for writing principles, we align a strong pretrained model (mixtral-8x7b) using constitutions written by a weak instruction-finetuned model (mistral-7b-instruct), achieving a 65% win rate on summarization. Finally, we investigate whether SAMI generalizes to diverse summarization principles (e.g., "summaries should be scientific") and scales to stronger models (llama3-70b), finding that it achieves win rates of up to 68% for learned and 67% for held-out principles compared to the base model. Our results show that a pretrained LM can learn to follow constitutions without using preference labels, demonstrations, or human oversight.
翻译:在提示语言模型(LM)时,用户通常期望模型能在不同任务中遵循一套行为准则,例如生成富有洞察力的内容同时避免有害或偏颇的语言。将此类原则(即"宪法")注入模型需要大量资源、技术挑战性高,且通常需要人工偏好标签或示例。我们提出SAMI算法——一种迭代式微调预训练语言模型的方法(无需偏好标签或演示),通过最大化数据集查询、宪法与模型自生成响应之间的条件互信息来优化模型。在单轮对话和摘要任务中,经SAMI训练的mistral-7b模型性能显著优于原始预训练模型,胜率介于66%至77%之间。更值得注意的是,其在单轮对话任务中甚至超过指令微调基线(mistral-7b-instruct),胜率达55%-57%。SAMI需要能撰写原则的模型。为避免依赖强模型编写原则,我们使用弱指令微调模型(mistral-7b-instruct)编写的宪法对齐强预训练模型(mixtral-8x7b),在摘要任务中实现65%的胜率。最后,我们探究SAMI能否泛化至多样化摘要原则(如"摘要应具有科学性")并扩展至更强模型(llama3-70b),发现相较于基础模型,其对已学习原则的胜率达68%,对未学习原则的胜率达67%。实验表明,预训练语言模型无需偏好标签、演示或人工监督即可学习遵循宪法。