Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.

翻译：大型语言模型（LLM）正在重塑推荐系统范式，使用户能够通过对话表达偏好并获取推荐。然而，将LLM与推荐任务对齐仍面临挑战：预训练的LLM常生成目录外的项目、违反要求的输出格式，且其排序质量在生成列表末尾急剧下降。为此，我们提出ConvRec-R1——一个用于端到端训练基于LLM的对话推荐系统的两阶段框架。在第一阶段，我们通过Remap-Reflect-Adjust流程构建行为克隆数据集，利用强大的黑盒LLM生成高质量、基于目录的示范数据，以预热强化学习训练。在第二阶段，我们提出Rank-GRPO——一种针对排序式输出任务定制的、基于组相对策略优化（GRPO）的原则性扩展方法。Rank-GRPO将推荐列表中的每个排序位置（而非过于细粒度的词元或过于粗粒度的序列）作为基本单元，通过重新定义奖励函数消除非因果信用分配问题，并引入基于排序级词元概率几何平均的排序级重要性比率以稳定策略更新。在公开Reddit-v2数据集上的实验表明，ConvRec-R1相比GRPO类基线方法收敛更快，并在Recall和NDCG指标上取得更优结果。代码与数据集发布于https://github.com/yaochenzhu/Rank-GRPO。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日