Semantic identifiers (SIDs) have gained increasing attention in generative retrieval (GR) for recommendation due to their meaningful semantic discriminability. However, current studies in this field primarily (1) offer limited investigation into the construction strategies for better SIDs, and (2) their SID assessment typically relies on costly GR training. To address these challenges, we propose FORGE, a comprehensive benchmark for FOrming semantic identifieRs for Generative rEtrieval. Specifically, FORGE provides a taxonomy of the SID construction process from several perspectives and validates their impact on downstream GR through offline experiments across diverse settings. Notably, these empirical findings have led to a 0.35% increase in transaction count via online A/B experiments in the Guess You Like section of Taobao. The corresponding SID construction strategies have since been deployed at full scale on Taobao, demonstrating their practical effectiveness. To avoid expensive SID assessment that requires full GR training, we propose two novel SID evaluation metrics that are highly correlated with recommendation performance, enabling convenient evaluations without any GR training. Furthermore, to facilitate the community, we release AL-GR, the industrial dataset used in our experiments, comprising 14 billion interactions and 250 million items with the corresponding multimodal features collected from Taobao. All the code and data are available at https://github.com/selous123/al_sid.
翻译:语义标识符(SIDs)因其实质性的语义区分能力,在生成式检索(GR)推荐中日益受到关注。然而,当前该领域的研究主要存在以下不足:(1) 对构建更优SID的策略探索有限;(2) SID评估通常依赖成本高昂的GR训练。针对这些挑战,我们提出FORGE——面向生成式检索的语义标识符构建综合基准测试框架。具体而言,FORGE从多维度提出SID构建过程的分类体系,并通过不同设置下的离线实验验证其对下游GR的影响。值得注意的是,这些实证发现已通过淘宝"猜你喜欢"板块的在线A/B实验实现交易量0.35%的提升,相应的SID构建策略已全面部署于淘宝平台,验证了其实用效能。为规避需完整GR训练的高成本SID评估,我们提出两种与推荐性能高度相关的新型SID评价指标,可在无需任何GR训练的情况下实现便捷评估。此外,为促进社区发展,我们发布实验中使用的工业数据集AL-GR,该数据集包含从淘宝收集的140亿交互数据、2.5亿商品及其多模态特征。所有代码与数据已开源至https://github.com/selous123/al_sid。