Long chain-of-thought reasoning (Long CoT) is now fundamental to state-of-the-art LLMs, especially in mathematical reasoning. However, LLM generation is highly sequential, and long CoTs lead to a high latency. We propose to train Divide-and-Conquer CoT (DC-CoT) to reduce the latency. With DC-CoT, the model can act as a director that identifies distinct subtasks that can be performed in parallel in its reasoning process, and then spawns workers to execute the subtasks. Our goal is to achieve high accuracy, with a low longest path length, which is a theoretical measure of the latency needed for the response. We start with a long CoT base model (DeepScaleR-1.5B-Preview), and first use SFT with a small curated demonstration set to initialize its ability to spawn workers in a certain format. Because SFT degrades the accuracy significantly, we design a multi-stage RL algorithm, with various data filtering strategies, to recover the accuracy while decreasing the longest path length. Across several benchmarks including AIME 2024 and HMMT 2025, DC-CoT achieves similar accuracy as DeepScaleR-1.5B-Preview while decreasing longest path length by 35-40%. Our code, SFT dataset and models are publicly available at https://github.com/amahankali10/DC_CoT_RL_for_Low_Latency_CoT_with_Parallel_Reasoning.
翻译:长思维链推理现已成为前沿大语言模型(尤其在数学推理领域)的核心能力。然而,大语言模型的生成过程具有高度序列化特性,长思维链会导致显著延迟。本文提出通过训练分治式思维链模型以降低延迟。该模型可充当调度器,在推理过程中识别可并行执行的独立子任务,随后创建子进程执行这些任务。我们的目标是在保证高准确率的同时,实现较短的关键路径长度——这是衡量响应延迟的理论指标。研究以长思维链基础模型(DeepScaleR-1.5B-Preview)为起点,首先通过小规模精选演示数据集进行监督微调,初始化模型按特定格式创建子进程的能力。由于监督微调会显著降低准确率,我们设计了多阶段强化学习算法并采用多种数据过滤策略,在缩短关键路径长度的同时恢复模型准确率。在AIME 2024、HMMT 2025等多个基准测试中,分治式思维链模型在保持与DeepScaleR-1.5B-Preview相当准确率的同时,将关键路径长度缩短了35-40%。相关代码、监督微调数据集及模型已开源:https://github.com/amahankali10/DC_CoT_RL_for_Low_Latency_CoT_with_Parallel_Reasoning。