Post-training methods such as supervised fine-tuning (SFT) and preference optimization typically align language models toward a single global assistant behavior. While effective for improving average helpfulness, this can suppress the natural variation of human responses across languages, tasks, and dialogue settings. We study this problem as conditional human-distribution alignment: models should match the human response distribution appropriate to the current interaction context, rather than a universal response style. We introduce PolyAlign, a distribution-aware alignment framework that organizes bilingual interaction data into bucket-specific human reference distributions defined by language, interaction track, response family, and length. PolyAlign combines Bucket-Aware SFT, which balances optimization across heterogeneous buckets, with Human-Distribution Preference Optimization (HDPO), which regularizes preference learning using critic-estimated distance to bucket-specific human support. Across a bilingual evaluation suite covering English and Chinese single- and multi-turn settings, PolyAlign improves conditional naturalness and distributional faithfulness while preserving competitive task utility. The results suggest that post-training should move beyond global alignment objectives toward interaction-aware alignment with human response distributions.
翻译:论文摘要:诸如监督微调(SFT)和偏好优化等后训练方法通常将语言模型对齐至单一全局化助手行为。尽管这有助于提升平均实用性,但会压制人类响应在不同语言、任务及对话场景中的自然变异。我们将此问题定义为条件化人类分布对齐:模型应根据当前交互上下文而非通用响应风格,匹配相宜的人类响应分布。我们提出PolyAlign——一种分布感知对齐框架,它将双语交互数据按语言、交互轨迹、响应族系及长度组织为桶特定的人类参考分布。PolyAlign结合了桶感知监督微调(Bucket-Aware SFT)(该技术平衡异质桶间的优化)与人类分布偏好优化(HDPO)(后者利用基于批评距离估计的桶特定人类支持,对偏好学习进行正则化)。在涵盖英文与中文单/多轮对话场景的双语评估套件中,PolyAlign在保持竞争性任务效用的同时,提升了条件自然度与分布忠实度。结果表明,后训练应超越全局对齐目标,转向面向交互的人类响应分布对齐。