Generative Retrieval (GR) has emerged as a powerful paradigm in e-commerce search, retrieving items via autoregressive decoding of Semantic IDs (SIDs). However, aligning GR with complex user preferences remains challenging. While Direct Preference Optimization (DPO) offers an efficient alignment solution, its direct application to structured SIDs suffers from three limitations: (i) it penalizes shared hierarchical prefixes, causing gradient conflicts; (ii) it is vulnerable to noisy pseudo-negatives from implicit feedback; and (iii) in multi-label queries with multiple relevant items, it exacerbates a probability "squeezing effect" among valid candidates. To address these issues, we propose RAD-DPO, which introduces token-level gradient detachment to protect prefix structures, similarity-based dynamic reward weighting to mitigate label noise, and a multi-label global contrastive objective integrated with global SFT loss to explicitly expand positive coverage. Extensive offline experiments and online A/B testing on a large-scale e-commerce platform demonstrate significant improvements in ranking quality and training efficiency.
翻译:生成式检索已成为电子商务搜索领域的一种强大范式,它通过自回归解码语义ID来检索商品。然而,使生成式检索与复杂的用户偏好保持一致仍然具有挑战性。虽然直接偏好优化提供了一种高效的偏好对齐解决方案,但其直接应用于结构化的语义ID存在三个局限性:(i) 它会惩罚共享的层次化前缀,导致梯度冲突;(ii) 容易受到来自隐式反馈的噪声伪负例的影响;(iii) 在具有多个相关商品的多标签查询中,它会加剧有效候选商品之间的概率"挤压效应"。为了解决这些问题,我们提出了RAD-DPO,它引入了令牌级梯度分离以保护前缀结构,基于相似性的动态奖励加权以减轻标签噪声,以及一个与全局监督微调损失相结合的多标签全局对比目标,以显式扩大正例覆盖范围。在一个大规模电子商务平台上进行的广泛离线实验和在线A/B测试表明,该方法在排序质量和训练效率方面均有显著提升。