Recently, there is a growing interest in developing next-generation recommender systems (RSs) based on pretrained large language models (LLMs), fully utilizing their encoded knowledge and reasoning ability. However, the semantic gap between natural language and recommendation tasks is still not well addressed, leading to multiple issues such as spuriously-correlated user/item descriptors, ineffective language modeling on user/item contents, and inefficient recommendations via auto-regression, etc. In this paper, we propose CLLM4Rec, the first generative RS that tightly integrates the LLM paradigm and ID paradigm of RS, aiming to address the above challenges simultaneously. We first extend the vocabulary of pretrained LLMs with user/item ID tokens to faithfully model the user/item collaborative and content semantics. Accordingly, in the pretraining stage, a novel soft+hard prompting strategy is proposed to effectively learn user/item collaborative/content token embeddings via language modeling on RS-specific corpora established from user-item interactions and user/item features, where each document is split into a prompt consisting of heterogeneous soft (user/item) tokens and hard (vocab) tokens and a main text consisting of homogeneous item tokens or vocab tokens that facilitates stable and effective language modeling. In addition, a novel mutual regularization strategy is introduced to encourage the CLLM4Rec to capture recommendation-oriented information from user/item contents. Finally, we propose a novel recommendation-oriented finetuning strategy for CLLM4Rec, where an item prediction head with multinomial likelihood is added to the pretrained CLLM4Rec backbone to predict hold-out items based on the soft+hard prompts established from masked user-item interaction history, where recommendations of multiple items can be generated efficiently.
翻译:近期,基于预训练大语言模型开发下一代推荐系统的研究日益受到关注,旨在充分利用其编码知识与推理能力。然而,自然语言与推荐任务之间的语义鸿沟仍未得到有效解决,导致用户/物品描述符的虚假关联、用户/物品内容语言建模效率低下,以及自回归推理的推荐效率不足等问题。本文提出CLLM4Rec——首个深度融合大语言模型范式与推荐系统ID范式的生成式推荐系统,旨在同时应对上述挑战。我们首先将用户/物品ID令牌扩展至预训练大语言模型词表,以忠实建模用户/物品协同与内容语义。基于此,在预训练阶段提出新型软+硬提示策略,通过基于用户-物品交互与用户/物品特征构建的推荐系统语料进行语言建模,有效学习用户/物品协同/内容令牌嵌入,其中每篇文档被划分为由异构软令牌(用户/物品)与硬令牌(词表)构成的提示部分,以及由同质物品令牌或词表令牌构成的主文本部分,从而促进稳定高效的语言建模。此外,引入新型互正则化策略,引导CLLM4Rec从用户/物品内容中捕获面向推荐的信息。最终,提出面向推荐的微调策略:在预训练CLLM4Rec骨干网络中添加基于多项似然的物品预测头,基于掩码用户-物品交互历史构建的软+硬提示,预测留出物品,实现多物品推荐的高效生成。