Large language models (LLMs) tackle complex tasks by generating long chains of thought or "reasoning traces" that act as latent variables in the generation of an output given a query. A model's ability to generate such traces can be optimized with reinforcement learning (RL) to improve their utility in predicting an answer. This optimization comes at a high computational cost, especially for narrative-related tasks that involve retrieving and processing many tokens. To this end, we propose LiteReason, a latent reasoning method that can be interleaved with standard token sampling and easily combined with RL techniques. LiteReason employs a lightweight Reasoning Projector module, trained to produce continuous latent tokens that help the model 'skip' reasoning steps. During RL, the policy model decides when to activate the projector, switching between latent and discrete reasoning as needed. Experimental results on plot hole detection and book chapter generation show that our method outperforms latent reasoning baselines and comes close to matching non-latent RL training, while reducing final reasoning length by 77-92%. Overall, LiteReason guides RL training to a more efficient part of the performance-computation tradeoff curve.
翻译:大语言模型(LLMs)通过生成长链思维或“推理痕迹”来应对复杂任务,这些痕迹在给定查询时作为生成输出过程中的潜在变量。模型生成此类痕迹的能力可通过强化学习(RL)优化,以提升其在预测答案中的效用。然而,这种优化会带来高昂的计算成本,尤其是在涉及检索与处理大量词元的叙事类任务中。为此,我们提出LiteReason——一种可与标准词元采样交替进行、并易于与RL技术结合的潜在推理方法。LiteReason采用轻量化的推理投影模块(Reasoning Projector),该模块经过训练可生成连续潜在词元,帮助模型“跳过”部分推理步骤。在RL过程中,策略模型能自主决定何时激活投影模块,根据需求在潜在推理与离散推理间切换。在情节漏洞检测与书籍章节生成任务上的实验结果表明,我们的方法不仅优于潜在推理基线,且性能接近非潜在RL训练,同时将最终推理长度缩减了77%–92%。总体而言,LiteReason引导RL训练走向计算效率更优的性能-计算权衡曲线区间。