We present PCL-Reasoner-V1.5, a 32-billion-parameter large language model (LLM) for mathematical reasoning. The model is built upon Qwen2.5-32B and refined via supervised fine-tuning (SFT) followed by reinforcement learning (RL). A central innovation is our proposed offline RL method, which provides superior training stability and efficiency over standard online RL methods such as GRPO. Our model achieves state-of-the-art performance among models post-trained on Qwen2.5-32B, attaining average accuracies of 90.9% on AIME 2024 and 85.6% on AIME 2025. Our work demonstrates offline RL as a stable and efficient paradigm for advancing reasoning in LLMs. All experiments were conducted on Huawei Ascend 910C NPUs.
翻译:我们提出了PCL-Reasoner-V1.5,一个拥有320亿参数、专用于数学推理的大语言模型。该模型基于Qwen2.5-32B构建,并通过监督微调及随后的强化学习进行优化。其核心创新在于我们提出的离线强化学习方法,相较于GRPO等标准的在线强化学习方法,该方法提供了更优的训练稳定性与效率。我们的模型在基于Qwen2.5-32B进行后训练的模型中实现了最先进的性能,在AIME 2024和AIME 2025上分别取得了90.9%和85.6%的平均准确率。我们的工作证明了离线强化学习是推进大语言模型推理能力的一种稳定且高效的范式。所有实验均在华为昇腾910C NPU上完成。