Direct Preference Optimization (DPO) guides large language models (LLMs) to generate recommendations aligned with user historical behavior distributions by minimizing preference alignment loss. However, our systematic empirical research and theoretical analysis reveal that DPO tends to amplify spurious correlations caused by environmental confounders during the alignment process, significantly undermining the generalization capability of LLM-based generative recommendation methods in out of distribution (OOD) scenarios. To mitigate this issue, we propose CausalDPO, an extension of DPO that incorporates a causal invariance learning mechanism. This method introduces a backdoor adjustment strategy during the preference alignment phase to eliminate interference from environmental confounders, explicitly models the latent environmental distribution using a soft clustering approach, and enhances robust consistency across diverse environments through invariance constraints. Theoretical analysis demonstrates that CausalDPO can effectively capture users stable preference structures across multiple environments, thereby improving the OOD generalization performance of LLM-based recommendation models. We conduct extensive experiments under four representative distribution shift settings to validate the effectiveness of CausalDPO, achieving an average performance improvement of 17.17% across four evaluation metrics.
翻译:直接偏好优化(DPO)通过最小化偏好对齐损失,引导大型语言模型(LLM)生成与用户历史行为分布一致的推荐内容。然而,我们的系统性实证研究与理论分析揭示,DPO在对齐过程中会放大由环境混杂因素导致的虚假相关性,显著削弱基于LLM的生成式推荐方法在分布外(OOD)场景下的泛化能力。为缓解该问题,我们提出CausalDPO——一种集成因果不变性学习机制的DPO扩展方法。该方法在偏好对齐阶段引入后门调整策略以消除环境混杂因素的干扰,通过软聚类方法对潜在环境分布进行显式建模,并借助不变性约束增强跨环境的鲁棒一致性。理论分析表明,CausalDPO能够有效捕获用户在多个环境中的稳定偏好结构,从而提升基于LLM的推荐模型的OOD泛化性能。我们在四种代表性分布偏移设置下开展大量实验,验证了CausalDPO的有效性,其在四项评估指标上实现了平均17.17%的性能提升。