Reinforced Preference Optimization for Reasoning-Augmented Recommendations

Recommender systems are critical for delivering personalized content across digital platforms, and recent advances in Large Language Models (LLMs) offer new opportunities to enhance them with richer world knowledge and explicit reasoning capabilities. With the help of reasoning knowledge, recommendations can better infer users' underlying intents, adapt to evolving preferences, and leverage semantic relationships for improved accuracy and interpretability. However, existing reasoning-based recommendation methods often fail to fully align the LLM's reasoning process with recommendation-specific objectives due to structural disruption during integration and difficulties in translating free-form generation into accurate item predictions. In this paper, we introduce RPORec, a reinforced preference optimization framework that unifies an LLM backbone's reasoning ability with a dedicated recommendation head (Rechead) for precise item retrieval. RPORec comprises two stages: (1) Reasoning-Augmented Recommendation Modeling, where high-quality Chain-of-Thought (CoT) reasoning is generated and used as auxiliary knowledge to guide the Rechead in learning recommendation-specific representations; and (2) Advanced Reasoning Refinement and Alignment, in which the trained Rechead produces verifiable rewards to fine-tune the LLM backbone via reinforcement learning, enhancing reasoning quality, structural consistency, and task relevance. Extensive experiments on public benchmarks and large-scale online deployments show that RPORec consistently outperforms state-of-the-art LLM-based recommendation methods, demonstrating the effectiveness of reasoning-augmented recommendation modeling in real-world systems.

翻译：推荐系统对于在数字平台上提供个性化内容至关重要，而大语言模型（LLM）的最新进展通过更丰富的世界知识和显式推理能力为其带来了新机遇。借助推理知识，推荐系统能更好地推断用户潜在意图、适应动态变化的偏好，并利用语义关系提升准确性和可解释性。然而，现有基于推理的推荐方法由于集成过程中的结构破坏以及将自由形式生成转化为精确项目预测的困难，往往无法完全对齐LLM的推理过程与推荐特定目标。本文提出RPORec——一种强化偏好优化框架，将LLM骨干网络的推理能力与专用推荐头（Rechead）统一起来，以实现精确的项目检索。RPORec包含两个阶段：（1）推理增强推荐建模，其中生成高质量思维链（CoT）推理并作为辅助知识指导Rechead学习推荐特定表示；（2）高级推理优化与对齐，通过训练好的Rechead生成可验证奖励，利用强化学习微调LLM骨干网络，提升推理质量、结构一致性和任务相关性。在公开基准和大规模在线部署上的广泛实验表明，RPORec持续优于最先进的基于LLM的推荐方法，证明了推理增强推荐建模在实际系统中的有效性。