Generative retrieval (GR) has emerged as a transformative paradigm in search and recommender systems, leveraging numeric-based identifier representations to enhance efficiency and generalization. Notably, methods like TIGER employing Residual Quantization-based Semantic Identifiers (RQ-SID), have shown significant promise in e-commerce scenarios by effectively managing item IDs. However, a critical issue termed the "\textbf{Hourglass}" phenomenon, occurs in RQ-SID, where intermediate codebook tokens become overly concentrated, hindering the full utilization of generative retrieval methods. This paper analyses and addresses this problem by identifying data sparsity and long-tailed distribution as the primary causes. Through comprehensive experiments and detailed ablation studies, we analyze the impact of these factors on codebook utilization and data distribution. Our findings reveal that the "Hourglass" phenomenon substantially impacts the performance of RQ-SID in generative retrieval. We propose effective solutions to mitigate this issue, thereby significantly enhancing the effectiveness of generative retrieval in real-world E-commerce applications.
翻译:生成式检索已成为搜索与推荐系统中的变革性范式,其通过基于数值的标识符表示来提升效率与泛化能力。值得注意的是,采用基于残差量化的语义标识符的方法(如TIGER)在电子商务场景中通过有效管理商品ID展现了显著潜力。然而,RQ-SID中存在一个被称为“**沙漏**”现象的关键问题,即中间码本标记过度集中,阻碍了生成式检索方法的充分利用。本文通过识别数据稀疏性与长尾分布作为主要原因,分析并解决了该问题。通过全面实验与详尽的消融研究,我们分析了这些因素对码本利用率和数据分布的影响。研究结果表明,“沙漏”现象显著影响RQ-SID在生成式检索中的性能。我们提出了缓解该问题的有效解决方案,从而显著提升了生成式检索在现实电子商务应用中的效能。