Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model's responses and the user's underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention--response similarity term increases the preference margin (by a positive shift of $λ\,Δ\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.
翻译:人类偏好具有多样性和动态性,受地域、文化和社会因素影响。现有的对齐方法,如直接偏好优化及其变体,通常默认采用多数观点,忽视了少数意见,且未能捕捉提示中潜在的用户意图。为应对这些局限,我们提出了自适应意图驱动的偏好优化。具体而言,A-IPO引入了一个意图模块,用于推断每个用户提示背后的潜在意图,并将此推断出的意图显式地纳入奖励函数,从而促进偏好模型的响应与用户潜在意图之间更强的对齐。我们从理论和实验上证明,引入意图-响应相似性项会增加偏好边际,导致偏好响应与非偏好响应之间比DPO更清晰的分离。为进行评估,我们引入了两个新基准,以及一个现有数据集的扩展版本,以评估现实世界和对抗性偏好对齐。通过对多样化用户意图的显式建模,A-IPO促进了多元化偏好优化,同时增强了偏好对齐的对抗鲁棒性。全面的实验评估表明,A-IPO始终优于现有基线,在关键指标上取得了显著提升。