Fine-grained Semantics Integration for Large Language Model-based Recommendation

Recent advances in Large Language Models (LLMs) have shifted in recommendation systems from the discriminative paradigm to the LLM-based generative paradigm, where the recommender autoregressively generates sequences of semantic identifiers (SIDs) for target items conditioned on historical interaction. While prevalent LLM-based recommenders have demonstrated performance gains by aligning pretrained LLMs between the language space and the SID space, modeling the SID space still faces two fundamental challenges: (1) Semantically Meaningless Initialization: SID tokens are randomly initialized, severing the semantic linkage between the SID space and the pretrained language space at start point, and (2) Coarse-grained Alignment: existing SFT-based alignment tasks primarily focus on item-level optimization, while overlooking the semantics of individual tokens within SID sequences.To address these challenges, we propose TS-Rec, which can integrate Token-level Semantics into LLM-based Recommenders. Specifically, TS-Rec comprises two key components: (1) Semantic-Aware embedding Initialization (SA-Init), which initializes SID token embeddings by applying mean pooling to the pretrained embeddings of keywords extracted by a teacher model; and (2) Token-level Semantic Alignment (TS-Align), which aligns individual tokens within the SID sequence with the shared semantics of the corresponding item clusters. Extensive experiments on two real-world benchmarks demonstrate that TS-Rec consistently outperforms traditional and generative baselines across all standard metrics. The results demonstrate that integrating fine-grained semantic information significantly enhances the performance of LLM-based generative recommenders.

翻译：近年来，大语言模型（LLMs）的进展使推荐系统从判别式范式转向基于LLM的生成式范式，其中推荐器以历史交互为条件自回归地生成目标项目的语义标识符（SIDs）序列。尽管主流的基于LLM的推荐器通过将预训练LLM在语言空间和SID空间之间对齐已展现出性能提升，但SID空间的建模仍面临两个根本性挑战：（1）语义缺失的初始化：SID标记被随机初始化，切断了SID空间与预训练语言空间在初始点的语义联系；（2）粗粒度对齐：现有基于SFT的对齐任务主要关注项目级优化，而忽视了SID序列内单个标记的语义。为应对这些挑战，我们提出TS-Rec，该方法可将标记级语义集成到基于LLM的推荐器中。具体而言，TS-Rec包含两个关键组件：（1）语义感知嵌入初始化（SA-Init），通过应用均值池化处理教师模型提取的关键词的预训练嵌入来初始化SID标记嵌入；（2）标记级语义对齐（TS-Align），将SID序列内的单个标记与对应项目簇的共享语义进行对齐。在两个真实世界基准数据集上的大量实验表明，TS-Rec在所有标准指标上均持续优于传统及生成式基线方法。结果证明，集成细粒度语义信息能显著提升基于LLM的生成式推荐器的性能。