While Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs), its adoption for enhancing recommendation quality is growing rapidly. In this work, we critically examine this trend and argue that Long CoT is inherently ill-suited for the sequential recommendation domain. We attribute this misalignment to two primary factors: excessive inference latency and the lack of explicit cognitive reasoning patterns in user behavioral data. Driven by these observations, we propose pivoting away from the CoT structure to directly leverage its underlying mechanism: Reinforcement Learning (RL), to explore the item space. However, applying RL directly faces significant obstacles, notably low sample efficiency-where most actions fail to provide learning signals-and training instability. To overcome these limitations, we propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation. RISER is designed to transform non-learnable trajectories into effective pairwise preference data for optimization. Furthermore, it incorporates specific strategies to ensure stability, including the prevention of redundant rollouts and the constraint of token-level update magnitudes. Extensive experiments on three real-world datasets show that RISER significantly outperforms competitive baselines, establishing a robust paradigm for RL-enhanced LLM recommendation. Our code will be available at https://anonymous.4open.science/r/RISER/.
翻译:尽管长链思维推理在大型语言模型中展现出潜力,其在提升推荐质量方面的应用正迅速增长。本研究对这一趋势进行批判性审视,并论证长链思维推理本质上不适用于序列推荐领域。我们将这种不匹配归因于两个主要因素:过高的推理延迟,以及用户行为数据中缺乏明确的认知推理模式。基于这些观察,我们建议摒弃链式思维结构,转而直接利用其底层机制——强化学习来探索项目空间。然而,直接应用强化学习面临显著障碍,尤其是样本效率低下(大多数动作无法提供学习信号)以及训练不稳定性。为克服这些局限,我们提出RISER,一种用于推荐的新型强化项目空间探索框架。RISER旨在将不可学习的轨迹转化为有效的成对偏好数据以进行优化。此外,它整合了特定策略以确保稳定性,包括防止冗余轨迹生成以及约束令牌级更新幅度。在三个真实世界数据集上的大量实验表明,RISER显著优于竞争基线,为基于强化学习增强的大语言模型推荐建立了稳健的范式。我们的代码将在 https://anonymous.4open.science/r/RISER/ 公开。