Semantic identifier (ID) is an important concept in information retrieval that aims to preserve the semantics of objects such as documents and items inside their IDs. Previous studies typically adopt a two-stage pipeline to learn semantic IDs by first procuring embeddings using off-the-shelf text encoders and then deriving IDs based on the embeddings. However, each step introduces potential information loss, and there is usually an inherent mismatch between the distribution of embeddings within the latent space produced by text encoders and the anticipated distribution required for semantic indexing. It is non-trivial to design a method that can learn the document's semantic representations and its hierarchical structure simultaneously, given that semantic IDs are discrete and sequentially structured, and the semantic supervision is deficient. In this paper, we introduce LMIndexer, a self-supervised framework to learn semantic IDs with a generative language model. We tackle the challenge of sequential discrete ID by introducing a semantic indexer capable of generating neural sequential discrete representations with progressive training and contrastive learning. In response to the semantic supervision deficiency, we propose to train the model with a self-supervised document reconstruction objective. We show the high quality of the learned IDs and demonstrate their effectiveness on three tasks including recommendation, product search, and document retrieval on five datasets from various domains. Code is available at https://github.com/PeterGriffinJin/LMIndexer.
翻译:语义标识符(ID)是信息检索中的一个重要概念,旨在将文档、物品等对象的语义信息保留在其ID中。以往的研究通常采用两阶段流程来学习语义ID:首先利用现成的文本编码器获取嵌入表示,再基于这些嵌入推导出ID。然而,每个阶段都会引入潜在的信息损失,且文本编码器生成的潜在空间中的嵌入分布通常与语义索引所需的预期分布存在固有差异。由于语义ID是离散且顺序结构的,同时语义监督信息不足,因此设计一种能够同时学习文档语义表示及其层次结构的方法并非易事。本文提出LMIndexer,一种基于生成式语言模型的自监督框架来学习语义ID。我们通过引入一个能够利用渐进式训练和对比学习生成神经序列离散表示的语义索引器,解决了离散序列ID的挑战。针对语义监督信息不足的问题,我们提出以自监督文档重建目标训练模型。实验结果表明,学习到的ID具有高质量,并在五个不同领域数据集上的三项任务(包括推荐、产品搜索和文档检索)中展示了其有效性。代码已开源:https://github.com/PeterGriffinJin/LMIndexer。