Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model's responses and the user's underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention--response similarity term increases the preference margin (by a positive shift of $λ\,Δ\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.
翻译:人类偏好具有多样性和动态性,受到地域、文化和社会因素的影响。现有的对齐方法,如直接偏好优化(DPO)及其变体,通常默认采用多数观点,忽视了少数意见,并且未能捕捉到提示中潜在的用户意图。为解决这些局限性,我们引入了自适应意图驱动的偏好优化(A-IPO)。具体而言,A-IPO引入了一个意图模块,用于推断每个用户提示背后的潜在意图,并将此推断出的意图明确纳入奖励函数,从而鼓励偏好模型的响应与用户潜在意图之间实现更强的对齐。我们从理论和实证两方面证明,引入意图-响应相似性项会增加偏好边际(在对数几率上产生 $λ\,Δ\mathrm{sim}$ 的正向偏移),与DPO相比,能在偏好响应与非偏好响应之间实现更清晰的区分。为进行评估,我们引入了两个新的基准测试集:Real-pref、Attack-pref,以及一个现有数据集GlobalOpinionQA的扩展版本GlobalOpinionQA-Ext,用以评估现实世界和对抗性偏好对齐。通过对多样化用户意图的显式建模,A-IPO促进了多元偏好优化,同时增强了偏好对齐的对抗鲁棒性。全面的实证评估表明,A-IPO始终优于现有基线,在关键指标上取得了显著提升:在Real-pref上,胜率最高提升+24.8,响应-意图一致性最高提升+45.6;在Attack-pref上,响应相似度最高提升+38.6,防御成功率最高提升+52.2;在GlobalOpinionQA-Ext上,意图一致性得分最高提升+54.6。