Recently, there is a growing interest in developing next-generation recommender systems (RSs) based on pretrained large language models (LLMs), fully utilizing their encoded knowledge and reasoning ability. However, the semantic gap between natural language and recommendation tasks is still not well addressed, leading to multiple issues such as spuriously-correlated user/item descriptors, ineffective language modeling on user/item contents, and inefficient recommendations via auto-regression, etc. In this paper, we propose CLLM4Rec, the first generative RS that tightly integrates the LLM paradigm and ID paradigm of RS, aiming to address the above challenges simultaneously. We first extend the vocabulary of pretrained LLMs with user/item ID tokens to faithfully model the user/item collaborative and content semantics. Accordingly, in the pretraining stage, a novel soft+hard prompting strategy is proposed to effectively learn user/item collaborative/content token embeddings via language modeling on RS-specific corpora established from user-item interactions and user/item features, where each document is split into a prompt consisting of heterogeneous soft (user/item) tokens and hard (vocab) tokens and a main text consisting of homogeneous item tokens or vocab tokens that facilitates stable and effective language modeling. In addition, a novel mutual regularization strategy is introduced to encourage the CLLM4Rec to capture recommendation-oriented information from user/item contents. Finally, we propose a novel recommendation-oriented finetuning strategy for CLLM4Rec, where an item prediction head with multinomial likelihood is added to the pretrained CLLM4Rec backbone to predict hold-out items based on the soft+hard prompts established from masked user-item interaction history, where recommendations of multiple items can be generated efficiently.
翻译:近期,基于预训练大语言模型(LLM)开发下一代推荐系统的研究日益兴起,旨在充分利用其编码知识与推理能力。然而,自然语言与推荐任务之间的语义鸿沟仍未得到有效解决,导致用户/项目描述符的虚假关联、用户/项目内容语言建模效率低下、自回归推荐效率不高等多重问题。本文提出CLLM4Rec——首个紧密融合LLM范式与推荐系统ID范式的生成式推荐系统,旨在同步解决上述挑战。我们首先将用户/项目ID标记扩展至预训练LLM的词表中,以忠实建模用户/项目的协同与内容语义。据此,在预训练阶段提出新型软+硬提示策略,通过基于用户-项目交互及用户/项目特征构建的推荐专属语料库进行语言建模,有效学习用户/项目协同/内容标记嵌入。其中每个文档被拆分为由异质软(用户/项目)标记与硬(词表)标记构成的提示部分,以及由同质项目标记或词表标记构成的主文本部分,从而实现稳定高效的语言建模。此外,引入新型互正则化策略,促使CLLM4Rec从用户/项目内容中捕获推荐导向信息。最后,我们为CLLM4Rec提出新型推荐导向微调策略:在预训练CLLM4Rec骨干网络基础上添加基于多项式似然的项目预测头,根据基于掩码用户-项目交互历史构建的软+硬提示预测留出项目,从而高效生成多项目推荐结果。