Recently, there has been a growing interest in leveraging Large Language Models (LLMs) for recommendation systems, which usually adapt a pre-trained LLM to the recommendation scenario through supervised fine-tuning (SFT). However, both the pre-training and SFT stages fail to explicitly model the comparative relationships of a user's preferences on different items. To construct a "helpful and harmless" LLM-based recommender, we propose a general framework -- Recommendation with smoothing personalized Preference Optimization (RosePO), which better aligns with customized human values during the post-training stage. Specifically, in addition to the input and chosen response that naturally align with SFT data, we design a rejected sampling strategy tailored for enhancing helpfulness, along with two strategies aimed at mitigating biases to promote harmlessness. To ensure robustness against uncertain labels present in automatically constructed preference data, we introduce a personalized smoothing factor predicted by a preference oracle into the optimization objective. Evaluation on three real-world datasets demonstrates the effectiveness of our method, showcasing not only improved recommendation performance but also mitigation of semantic hallucination and popularity bias.
翻译:近年来,利用大语言模型构建推荐系统受到日益广泛的关注,这类方法通常通过监督微调将预训练的大语言模型适配到推荐场景。然而,无论是预训练阶段还是监督微调阶段,都未能显式建模用户对不同物品偏好的比较关系。为构建“有益且无害”的基于大语言模型的推荐系统,我们提出了一个通用框架——基于平滑个性化偏好优化的推荐系统,该框架在训练后阶段能更好地与定制化的人类价值观对齐。具体而言,除了与监督微调数据自然对齐的输入及选定响应外,我们设计了专门用于增强有益性的拒绝采样策略,以及两种旨在减少偏见以促进无害性的策略。为确保对自动构建的偏好数据中存在的噪声标签具有鲁棒性,我们在优化目标中引入了由偏好预测器生成的个性化平滑因子。在三个真实世界数据集上的评估验证了我们方法的有效性,不仅展示了推荐性能的提升,同时缓解了语义幻觉和流行度偏差问题。