Large Language Models (LLMs) have demonstrated powerful reasoning capabilities through Chain-of-Thought (CoT) in various tasks, yet the inefficiency of token-by-token generation hinders real-world deployment in latency-sensitive recommender systems. Latent reasoning has emerged as an effective paradigm in LLMs, performing multi-step inference in a continuous hidden-state space to achieve stronger reasoning at lower cost. However, this paradigm remains underexplored in mainstream generative recommendation. Adapting it reveals three unique challenges: (1) the gap between prior-less Semantic ID (SID) symbols and continuous latent reasoning - SIDs lack pre-trained semantics, hindering joint optimization; (2) representation drift due to a lack of reasoning chain supervision; and (3) the suboptimality of applying a globally fixed reasoning depth. To address these, we propose LASAR (Latent Adaptive Semantic Aligned Reasoning), an SFT-then-RL framework. First, we bridge this gap via two-stage training: Stage 1 grounds SID semantics before Stage 2 introduces latent reasoning, ensuring efficient convergence. Second, we mitigate representation drift through explicit CoT semantic alignment. Step-wise bidirectional KL divergence constrains the latent reasoning trajectory using hidden-state anchors extracted from CoT text, while a Policy Head predicts per-sample reasoning depth. Third, during the GRPO-based RL phase, terminal-only KL alignment accommodates variable-length reasoning, and REINFORCE optimizes the Policy Head to dynamically allocate steps. This nearly halves the average latent step count while simultaneously improving recommendation quality. Experiments on three real-world datasets demonstrate that LASAR outperforms all baselines. It adds marginal inference latency and is roughly 20 times faster than generating explicit CoT text.
翻译:大语言模型通过思维链(CoT)在各类任务中展现出强大的推理能力,但逐token生成的低效性阻碍了其在延迟敏感型推荐系统中的实际部署。潜在推理作为一种新兴范式,通过连续隐状态空间中的多步推理,以更低成本实现更高性能的推理。然而,该范式在主流生成式推荐中仍未得到充分探索。将其适配至推荐场景需解决三个独特挑战:(1)缺乏先验语义的语义ID符号与连续潜在推理之间的鸿沟——语义ID缺少预训练语义,阻碍联合优化;(2)因缺乏推理链监督导致的表征漂移;(3)全局固定推理深度的次优性。为此,我们提出LASAR(潜在自适应语义对齐推理)——一个先SFT后RL的框架。首先,通过两阶段训练弥合鸿沟:第一阶段构建语义ID语义基础,第二阶段引入潜在推理,确保高效收敛。其次,通过显式CoT语义对齐缓解表征漂移:利用从CoT文本提取的隐状态锚点,通过逐步骤双向KL散度约束潜在推理轨迹;同时采用策略头预测样本级推理深度。第三,在基于GRPO的强化学习阶段,仅末端KL对齐适配可变长度推理,并通过REINFORCE优化策略头以实现动态步数分配。该方法在保持推荐质量提升的同时,将平均潜在推理步数减少近半。三个真实数据集上的实验表明,LASAR优于所有基线方法,其推理延迟增量极小,且速度约为生成显式CoT文本的20倍。