TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation

There is a growing interest in utilizing large-scale language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and in-context learning capabilities. In this scenario, tokenizing (i.e., indexing) users and items becomes essential for ensuring a seamless alignment of LLMs with recommendations. While several studies have made progress in representing users and items through textual contents or latent representations, challenges remain in efficiently capturing high-order collaborative knowledge into discrete tokens that are compatible with LLMs. Additionally, the majority of existing tokenization approaches often face difficulties in generalizing effectively to new/unseen users or items that were not in the training corpus. To address these challenges, we propose a novel framework called TokenRec, which introduces not only an effective ID tokenization strategy but also an efficient retrieval paradigm for LLM-based recommendations. Specifically, our tokenization strategy, Masked Vector-Quantized (MQ) Tokenizer, involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving a smooth incorporation of high-order collaborative knowledge and a generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-$K$ items for users to eliminate the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.

翻译：随着大规模语言模型（LLM）凭借其卓越的语言理解和上下文学习能力推动下一代推荐系统（RecSys）的发展，如何将用户和物品进行标记化（即索引）成为确保LLM与推荐任务无缝衔接的关键。尽管已有研究通过文本内容或隐式表征对用户和物品进行表示，但如何高效地将高阶协同知识编码为与LLM兼容的离散标记仍面临挑战。此外，现有标记化方法大多难以有效泛化至训练语料中未出现的新用户或新物品。为应对这些挑战，我们提出了名为TokenRec的新型框架，该框架不仅引入了有效的ID标记化策略，还为基于LLM的推荐设计了高效的检索范式。具体而言，我们提出的掩码向量量化（MQ）标记化策略，通过将协同过滤学习到的掩码用户/物品表征量化为离散标记，实现了高阶协同知识的平滑融合，并为基于LLM的推荐系统提供了可泛化的用户-物品标记化方案。同时，我们设计的生成式检索范式能够高效地为用户推荐top-$K$物品，避免了LLM耗时的自回归解码和束搜索过程，从而显著降低推理时间。综合实验验证了所提方法的有效性，表明TokenRec在包括传统推荐系统和新兴的基于LLM的推荐系统在内的竞争性基准测试中均表现出优越性能。