In large language model (LLM)-based recommendation systems, direct preference optimization (DPO) effectively aligns recommendations with user preferences, requiring multi-negative objective functions to leverage abundant implicit-feedback negatives and sharpen preference boundaries. However, our empirical analyses reveal a counterintuitive phenomenon, preference optimization collapse, where increasing the number of negative samples can lead to performance degradation despite a continuously decreasing training loss. We further theoretically demonstrate that this collapse arises from gradient suppression, caused by the dominance of easily discriminable negatives over boundary-critical negatives that truly define user preference boundaries. As a result, boundary-relevant signals are under-optimized, weakening the model's decision boundary. Motivated by these observations, we propose DynamicPO (Dynamic Preference Optimization), a lightweight and plug-and-play framework comprising two adaptive mechanisms: Dynamic Boundary Negative Selection, which identifies and prioritizes informative negatives near the model's decision boundary, and Dual-Margin Dynamic beta Adjustment, which calibrates optimization strength per sample according to boundary ambiguity. Extensive experiments on three public datasets show that DynamicPO effectively prevents optimization collapse and improves recommendation accuracy on multi-negative preference optimization methods, with negligible computational overhead. Our code and datasets are available at https://github.com/xingyuHuxingyu/DynamicPO.
翻译:基于大语言模型的推荐系统中,直接偏好优化方法通过利用多负样本目标函数,有效挖掘丰富的隐式反馈负信号并锐化偏好边界,从而将推荐结果与用户偏好精准对齐。然而,我们的实证分析揭示了一个反直觉现象——偏好优化崩溃:随着负样本数量增加,尽管训练损失持续下降,模型性能却出现退化。我们从理论上证明,这种崩溃源于梯度抑制效应:易判别负样本主导了梯度更新,抑制了真正决定用户偏好边界的边界关键负样本的优化作用,导致边界相关信号欠优化,削弱模型决策边界。基于此,我们提出轻量级即插即用框架DynamicPO(动态偏好优化),包含两种自适应机制:动态边界负样本选择机制(识别并优先优化决策边界附近的含信息量负样本)和双边缘动态β调整机制(根据边界模糊度逐样本校准优化强度)。在三个公开数据集上的大量实验表明,DynamicPO在极低计算开销下有效防止优化崩溃,显著提升多负样本偏好优化方法的推荐准确性。我们的代码与数据集已开源至https://github.com/xingyuHuxingyu/DynamicPO。