Reasoning Large Language Models (RLLMs) have recently achieved remarkable progress on complex reasoning tasks, largely enabled by their long chain-of-thought (Long CoT) capabilities. However, developing these Long CoT behaviors relies heavily on post-training with high-quality datasets, which are typically costly and human-curated (e.g., mathematics and code), leaving scalable alternatives unexplored. In this work, we introduce NP-hard (NPH) graph problems as a novel synthetic training corpus, as they inherently require deep reasoning, extensive exploration, and reflective strategies, which are the core characteristics of Long CoT reasoning. Building on this insight, we develop a two-stage post-training framework: (i) Long-CoT Supervised Fine-Tuning (SFT) on rejection-sampled NPH graph instances, which substantially enhances reasoning depth, and (ii) Reinforcement Learning (RL) with a fine-grained reward design, which sharpens reasoning efficiency. The resulting NPG-Muse-series models exhibit substantially enhanced Long CoT reasoning capabilities, achieving consistent gains across mathematics, coding, logical, and graph reasoning benchmarks. NPG-Muse-7B even surpasses QwQ-32B on NPH graph problems in both accuracy and reasoning efficiency. These results position NPH graph problems as an effective and scalable resource for advancing Long CoT reasoning in LLM post-training. Our implementation is available at https://github.com/littlewyy/NPG-Muse.
翻译:推理大语言模型(RLLMs)近期在复杂推理任务上取得了显著进展,这主要得益于其长思维链(Long CoT)能力。然而,开发这些长思维链行为严重依赖于使用高质量数据集进行后训练,这些数据集通常成本高昂且由人工精心整理(例如数学和代码领域),致使可扩展的替代方案未被充分探索。在本工作中,我们引入NP难(NPH)图问题作为一种新颖的合成训练语料,因为它们本质上需要深度推理、广泛探索和反思性策略,这些正是长思维链推理的核心特征。基于这一洞见,我们开发了一个两阶段后训练框架:(i)在经拒绝采样的NPH图实例上进行长思维链监督微调(SFT),这显著增强了推理深度;以及(ii)采用细粒度奖励设计的强化学习(RL),以提升推理效率。由此产生的NPG-Muse系列模型展现出显著增强的长思维链推理能力,在数学、编码、逻辑和图推理基准测试中均取得了一致的性能提升。NPG-Muse-7B甚至在NPH图问题上,在准确性和推理效率两方面均超越了QwQ-32B。这些结果表明,NPH图问题可作为推进LLM后训练中长思维链推理的一种有效且可扩展的资源。我们的实现可在 https://github.com/littlewyy/NPG-Muse 获取。