There is a growing need for pluralistic alignment methods that can steer language models towards individual attributes and preferences. One such method, Self-Supervised Alignment with Mutual Information (SAMI), uses conditional mutual information to encourage the connection between behavioral preferences and model responses. We conduct two experiments exploring SAMI in multi-task settings. First, we compare SAMI to Direct Preference Optimization (DPO) on a multi-task benchmark (MT-Bench), using a stronger model to generate training data for a weaker one across diverse categories (humanities, STEM, extraction, coding, math, reasoning, and roleplay). Our results indicate that one iteration of SAMI has a 57% win rate against DPO, with significant variation in performance between task categories. Second, we examine SAMI's impact on mathematical accuracy (GSM-8K) relative to supervised fine-tuning (SFT). While SAMI increases zero-shot performance by 1.1%, SFT is more effective with a 3.2% boost. However, SAMI shows interesting scaling trends. When given 10 attempts, SAMI improves accuracy by 3.9%, while SFT achieves a 10.1% increase. Combining SAMI with SFT yields an additional improvement of 1.3% in multi-attempt settings, though single-attempt accuracy remains unchanged.
翻译:随着对能够引导语言模型适应个体属性和偏好的多元化对齐方法的需求日益增长,自监督互信息对齐(SAMI)作为一种方法,利用条件互信息来加强行为偏好与模型响应之间的联系。我们进行了两项实验,探索SAMI在多任务设置中的应用。首先,我们在多任务基准测试(MT-Bench)上将SAMI与直接偏好优化(DPO)进行比较,使用一个更强的模型为较弱的模型生成涵盖多个类别(人文、STEM、信息提取、编码、数学、推理和角色扮演)的训练数据。我们的结果表明,单次迭代的SAMI对DPO的胜率达到57%,且在不同任务类别间性能存在显著差异。其次,我们考察了SAMI相对于监督微调(SFT)在数学准确性(GSM-8K)上的影响。虽然SAMI将零样本性能提升了1.1%,但SFT更为有效,提升了3.2%。然而,SAMI显示出有趣的扩展趋势。当给予10次尝试时,SAMI将准确率提高了3.9%,而SFT实现了10.1%的提升。在多尝试设置中,将SAMI与SFT结合使用可额外提升1.3%的准确率,尽管单次尝试的准确率保持不变。