Recent studies show that the reasoning capabilities of Large Language Models (LLMs) can be improved by applying Reinforcement Learning (RL) to question-answering (QA) tasks in areas such as math and coding. With a long context length, LLMs may learn to perform search, as indicated by the self-correction behavior observed in DeepSeek R1. However, this search behavior is often imprecise and lacks confidence, resulting in long, redundant responses and highlighting deficiencies in intuition and verification. Inspired by the Dual Process Theory in psychology, we introduce a simple modification to the QA task that includes four stages: Fast Thinking, where the LLM must answer within a strict token budget; Verification, where the model evaluates its initial response; Slow Thinking, where it refines the initial response with more deliberation; and Summarization, where it distills the refinement from the previous stage into precise steps. Our proposed task improves average accuracy from 25.6% to 27.3% for Qwen2.5-1.5B, and from 45.9% to 51.0% for DeepSeek-R1-Qwen-1.5B. Notably, for Qwen2.5-1.5B, the Fast Thinking mode alone achieves 25.2% accuracy using fewer than 1000 tokens, demonstrating substantial inference efficiency gains. These findings suggest that intuition and deliberative reasoning are distinct, complementary systems benefiting from targeted training. Additionally, we have open-sourced both the trained models and the source code.
翻译:近期研究表明,通过将强化学习应用于数学和编程等领域的问答任务,可以提升大语言模型的推理能力。在较长上下文条件下,大语言模型可能学会执行搜索行为,这在DeepSeek R1观察到的自我修正现象中有所体现。然而这种搜索行为往往不够精确且缺乏置信度,导致生成冗长冗余的响应,同时凸显出直觉与验证能力的不足。受心理学双过程理论启发,我们提出一种改进问答任务的简单方法,包含四个阶段:快思考阶段要求模型在严格令牌预算内作答;验证阶段对初始响应进行评估;慢思考阶段通过更深入的思考完善初始响应;总结阶段将前一阶段的修正提炼为精确步骤。我们提出的方法使Qwen2.5-1.5B的平均准确率从25.6%提升至27.3%,DeepSeek-R1-Qwen-1.5B从45.9%提升至51.0%。值得注意的是,Qwen2.5-1.5B仅通过快思考模式就能以少于1000个令牌达到25.2%的准确率,显示出显著的推理效率提升。这些发现表明,直觉与审慎推理是两种独立且互补的系统,可通过针对性训练获得提升。此外,我们已开源训练模型与源代码。