Generative Retrieval (GR) is rapidly transforming e-commerce search by replacing traditional multi-stage pipelines with the autoregressive decoding of structured Semantic IDs (SIDs). Despite this architectural efficiency, aligning GR models with nuanced, real-world user preferences remains a critical challenge. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability "squeezing effect" among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline evaluations and large-scale online A/B testing on JD.com's core search engine demonstrate that RAD-DPO achieves significant improvements in both retrieval precision and training efficiency, proving its robustness for massive industrial deployments.
翻译:生成式检索(GR)通过结构化语义ID(SIDs)的自回归解码替代传统多阶段流水线,正迅速变革电子商务搜索。尽管这种架构具有效率优势,但如何使GR模型与复杂、真实的用户偏好对齐仍是一项关键挑战。直接偏好优化(DPO)虽提供了高效的对齐方案,但其直接应用于结构化SIDs时存在三个局限:(i)会惩罚共享层级前缀结构,导致梯度冲突;(ii)易受隐式反馈中噪声伪负样本的影响;(iii)在多标签查询(包含多个相关物品)场景下,会加剧有效候选对象间的概率"挤压效应"。针对上述问题,我们提出RAD-DPO方法,通过引入令牌级梯度分离保护前缀结构、基于相似度的动态奖励加权缓解标签噪声,并结合全局SFT损失的多元标签全局对比目标显式扩展正样本覆盖范围。在京东核心搜索引擎上的离线评估与大规模在线A/B测试表明,RAD-DPO在检索精度与训练效率上均取得显著提升,验证了其面向大规模工业部署的鲁棒性。