Large Language Models (LLMs) have exhibited remarkable performance across a wide range of domains, motivating research into their potential for recommendation systems. Early efforts have leveraged LLMs' rich knowledge and strong generalization capabilities via in-context learning, where recommendation tasks are framed as prompts. However, LLM performance in recommendation scenarios remains limited due to the mismatch between their pretraining objectives and recommendation tasks, as well as the lack of recommendation-specific data during pretraining. To address these challenges, we propose DPO4Rec, a novel framework that integrates Direct Preference Optimization (DPO) into LLM-enhanced recommendation systems. First, we prompt the LLM to infer user preferences from historical interactions, which are then used to augment traditional ID-based sequential recommendation models. Next, we train a reward model based on knowledge-augmented recommendation architectures to assess the quality of LLM-generated reasoning. Using this, we select the highest- and lowest-ranked responses from N samples to construct a dataset for LLM fine-tuning. Finally, we apply a structure alignment strategy via DPO to align the LLM's outputs with desirable recommendation behavior. Extensive experiments show that DPO4Rec significantly improves re-ranking performance over strong baselines, demonstrating enhanced instruction-following capabilities of LLMs in recommendation tasks.
翻译:大型语言模型(LLM)在众多领域展现出卓越性能,这激发了对其在推荐系统中应用潜力的研究。早期工作通过上下文学习,将推荐任务构建为提示,从而利用LLM丰富的知识和强大的泛化能力。然而,由于LLM的预训练目标与推荐任务之间存在不匹配,且预训练过程中缺乏推荐领域特定数据,LLM在推荐场景中的性能仍受限制。为应对这些挑战,我们提出了DPO4Rec,一个将直接偏好优化(DPO)集成到LLM增强推荐系统中的新颖框架。首先,我们提示LLM从历史交互中推断用户偏好,并利用这些偏好增强传统的基于ID的序列推荐模型。接着,我们基于知识增强的推荐架构训练一个奖励模型,以评估LLM生成推理的质量。利用该模型,我们从N个样本中选择排名最高和最低的响应,构建用于LLM微调的数据集。最后,我们通过DPO应用结构对齐策略,使LLM的输出与期望的推荐行为保持一致。大量实验表明,DPO4Rec在重排序性能上显著优于强基线模型,证明了LLM在推荐任务中指令遵循能力的增强。