RAD-DPO: Robust Adaptive Denoising Direct Preference Optimization for Generative Retrieval in E-commerce

Generative Retrieval (GR) has emerged as a powerful paradigm in e-commerce search, retrieving items via autoregressive decoding of Semantic IDs (SIDs). However, aligning GR with complex user preferences remains challenging. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability "squeezing effect" among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline experiments and online A/B testing on a large-scale e-commerce platform demonstrate significant improvements in ranking quality and training efficiency.

翻译：生成式检索已成为电子商务搜索领域的一种强大范式，它通过自回归解码语义ID来检索商品。然而，使生成式检索与复杂的用户偏好保持一致仍然具有挑战性。虽然直接偏好优化提供了一种高效的偏好对齐解决方案，但其直接应用于结构化的语义ID存在三个局限性：(i) 它会惩罚共享的层次化前缀，导致梯度冲突；(ii) 容易受到来自隐式反馈的噪声伪负例的影响；(iii) 在具有多个相关商品的多标签查询中，它会加剧有效候选商品之间的概率"挤压效应"。为了解决这些问题，我们提出了RAD-DPO，它引入了令牌级梯度分离以保护前缀结构，基于相似性的动态奖励加权以减轻标签噪声，以及一个与全局监督微调损失相结合的多标签全局对比目标，以显式扩大正例覆盖范围。在一个大规模电子商务平台上进行的广泛离线实验和在线A/B测试表明，该方法在排序质量和训练效率方面均有显著提升。

相关内容

电子商务

关注 2

电子商务（ Electronic Commerce）的定义： 电子商务是利用计算机技术、网络技术和远程通信技术，实现电子化、数字化和网络化的整个商务过程。　　联合国国际贸易程序简化工作组对电子商务的定义是：采用电子形式开展商务活动，它包括在供应商、客户、政府及其他参与方之间通过任何电子工具，如 EDI、 Web技术、电子邮件等共享非结构化商务信息，并管理和完成在商务活动、管理活动和消费活动中的各种交易。

【博士论文】用于搜索的 Transformer 模型：检索、鲁棒性与拒绝机制

专知会员服务

10+阅读 · 2月8日

【博士论文】电商搜索中的排序学习

专知会员服务

13+阅读 · 2025年11月15日

多样化偏好优化

专知会员服务

12+阅读 · 2025年2月3日