We propose Rec-R1, a general reinforcement learning framework that bridges large language models (LLMs) with recommendation systems through closed-loop optimization. Unlike prompting and supervised fine-tuning (SFT), Rec-R1 directly optimizes LLM generation using feedback from a fixed black-box recommendation model, without relying on synthetic SFT data from proprietary models such as GPT-4o. This avoids the substantial cost and effort required for data distillation. To verify the effectiveness of Rec-R1, we evaluate it on two representative tasks: product search and sequential recommendation. Experimental results demonstrate that Rec-R1 not only consistently outperforms prompting- and SFT-based methods, but also achieves significant gains over strong discriminative baselines, even when used with simple retrievers such as BM25. Moreover, Rec-R1 preserves the general-purpose capabilities of the LLM, unlike SFT, which often impairs instruction-following and reasoning. These findings suggest Rec-R1 as a promising foundation for continual task-specific adaptation without catastrophic forgetting.
翻译:我们提出Rec-R1,一种通用的强化学习框架,通过闭环优化桥接大语言模型(LLMs)与推荐系统。与提示工程和监督微调(SFT)不同,Rec-R1直接利用来自固定黑盒推荐模型的反馈来优化LLM生成,而无需依赖如GPT-4o等专有模型生成的合成SFT数据。这避免了数据蒸馏所需的大量成本与精力。为验证Rec-R1的有效性,我们在两个代表性任务上进行了评估:产品搜索与序列推荐。实验结果表明,Rec-R1不仅持续优于基于提示工程和SFT的方法,即使在使用如BM25等简单检索器时,也显著超越了强大的判别式基线。此外,与常损害指令遵循和推理能力的SFT不同,Rec-R1保留了LLM的通用能力。这些发现表明,Rec-R1为持续的任务特定适应提供了一个有前景的基础,且不会发生灾难性遗忘。