Reinforcement learning plays a crucial role in generative re-ranking scenarios due to its exploration-exploitation capabilities, but existing generative methods mostly fail to adapt to the dynamic entropy changes in model difficulty during list generation, making it challenging to accurately capture complex preferences. Given that language models have achieved remarkable breakthroughs by integrating reasoning capabilities, we draw on this approach to introduce a latent reasoning mechanism, and experimental validation demonstrates that this mechanism effectively reduces entropy in the model's decision-making process. Based on these findings, we introduce the Entropy-Guided Latent Reasoning (EGLR) recommendation model, which has three core advantages. First, it abandons the "reason first, recommend later" paradigm to achieve "reasoning while recommending", specifically designed for the high-difficulty nature of list generation by enabling real-time reasoning during generation. Second, it implements entropy-guided variable-length reasoning using context-aware reasoning token alongside dynamic temperature adjustment, expanding exploration breadth in reasoning and boosting exploitation precision in recommending to achieve a more precisely adapted exploration-exploitation trade-off. Third, the model adopts a lightweight integration design with no complex independent modules or post-processing, enabling easy adaptation to existing models. Experimental results on two real-world datasets validate the model's effectiveness, and its notable advantage lies in being compatible with existing generative re-ranking models to enhance their performance. Further analyses also demonstrate its practical deployment value and research potential.
翻译:强化学习凭借其探索与利用的平衡能力在生成式重排序场景中至关重要,但现有生成方法大多难以适应列表生成过程中模型动态熵变化所表征的难度波动,导致难以精准捕捉复杂用户偏好。鉴于语言模型通过融合推理能力已取得显著突破,我们借鉴此思路引入潜在推理机制,实验验证表明该机制能有效降低模型决策过程中的熵值。基于此发现,我们提出熵引导潜在推理推荐模型,其具备三大核心优势:首先,摒弃“先推理后推荐”范式,实现“推理与推荐同步”,专为列表生成的高难度特性设计,支持生成过程中的实时推理;其次,通过上下文感知推理令牌与动态温度调节实现熵引导的变长推理,在推理中拓宽探索广度,在推荐中提升利用精度,达成更精准的自适应探索-利用权衡;再次,模型采用轻量级集成设计,无需复杂独立模块或后处理,可便捷适配现有模型。在两个真实数据集上的实验结果验证了模型有效性,其显著优势在于能与现有生成式重排序模型兼容并提升其性能。进一步分析亦证明了其实践部署价值与研究潜力。