Preference tuning aligns pretrained language models to human judgments of quality, helpfulness, or safety by optimizing over explicit preference signals rather than likelihood alone. Prior work has shown that preference-tuning degrades performance and reduces helpfulness when evaluated outside the training domain. However, the extent to which adaptation strategies mitigate this domain shift remains unexplored. We address this challenge by conducting a comprehensive and systematic study of alignment generalization under domain shift. We compare five popular alignment objectives and various adaptation strategies from source to target, including target-domain supervised fine-tuning and pseudo-labeling, across summarization and question-answering helpfulness tasks. Our findings reveal systematic differences in generalization across alignment objectives under domain shift. We show that adaptation strategies based on pseudo-labeling can substantially reduce domain-shift degradation
翻译:偏好调优通过优化显式的偏好信号(而非仅依赖似然性),使预训练语言模型与人类在质量、帮助性或安全性方面的判断保持一致。先前研究表明,当在训练领域之外进行评估时,偏好调优会降低模型性能并削弱其帮助性。然而,适应策略能在多大程度上缓解这种领域迁移的影响仍未得到充分探索。为应对这一挑战,我们开展了领域迁移下对齐泛化性的全面系统研究。我们在摘要生成和问答帮助性任务中,比较了五种主流对齐目标及多种从源领域到目标领域的适应策略(包括目标领域监督微调与伪标注方法)。研究结果表明,不同对齐目标在领域迁移下的泛化能力存在系统性差异。我们证明,基于伪标注的适应策略能显著减轻领域迁移导致的性能退化。