The integration of reinforcement learning (RL) into large language models (LLMs) has opened new opportunities for recommender systems by eliciting reasoning and improving user preference modeling. However, RL-based LLM recommendation faces significant efficiency challenges, making full-data training costly. Existing data selection methods define sample value based on learnability or representativeness, yet their loss- or gradient-driven or dataset coverage-driven criteria often misalign with RL learning dynamics, resulting in suboptimal performance. To address this, we propose MiniRec, a data selection framework tailored for RL-based LLM recommendation. MiniRec evaluates sample learnability using key RL signals -- rewards -- pruning samples that are too easy (too high reward) or too difficult (consistently low reward). It assesses representativeness by aligning sample gradients with the approximated "ideal" global RL optimization trajectory, selecting samples that mainly drive model updates, and it also enforces diversity to reduce redundancy. Combined with a curriculum learning strategy from easy to hard samples, MiniRec significantly reduces training cost while largely preserving performance. Extensive experiments demonstrate MiniRec's effectiveness, highlighting the importance of reward-aligned, trajectory-informed data selection in RL-based LLM recommendation.
翻译:将强化学习(RL)融入大语言模型(LLMs)为推荐系统开辟了新机遇,通过激发推理能力和改进用户偏好建模。然而,基于RL的LLM推荐面临显著的效率挑战,使得全数据训练成本高昂。现有数据选择方法基于可学习性或代表性定义样本价值,但其损失或梯度驱动或数据集覆盖驱动的标准常与RL学习动态不匹配,导致次优性能。为解决此问题,我们提出了MiniRec,一个专为基于RL的LLM推荐设计的数据选择框架。MiniRec使用关键RL信号——奖励——来评估样本可学习性,剔除过于简单(奖励过高)或过于困难(持续低奖励)的样本。它通过将样本梯度与近似的“理想”全局RL优化轨迹对齐来评估代表性,选择主要驱动模型更新的样本,并同时强制多样性以减少冗余。结合从易到难样本的课程学习策略,MiniRec显著降低了训练成本,同时很大程度上保持了性能。大量实验证明了MiniRec的有效性,凸显了在基于RL的LLM推荐中,与奖励对齐、轨迹信息驱动的数据选择的重要性。