Effective reasoning remains a core challenge for large language models (LLMs) in the financial domain, where tasks often require domain-specific knowledge, precise numerical calculations, and strict adherence to compliance rules. We propose DianJin-R1, a reasoning-enhanced framework designed to address these challenges through reasoning-augmented supervision and reinforcement learning. Central to our approach is DianJin-R1-Data, a high-quality dataset constructed from CFLUE, FinQA, and a proprietary compliance corpus (Chinese Compliance Check, CCC), combining diverse financial reasoning scenarios with verified annotations. Our models, DianJin-R1-7B and DianJin-R1-32B, are fine-tuned from Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct using a structured format that generates both reasoning steps and final answers. To further refine reasoning quality, we apply Group Relative Policy Optimization (GRPO), a reinforcement learning method that incorporates dual reward signals: one encouraging structured outputs and another rewarding answer correctness. We evaluate our models on five benchmarks: three financial datasets (CFLUE, FinQA, and CCC) and two general reasoning benchmarks (MATH-500 and GPQA-Diamond). Experimental results show that DianJin-R1 models consistently outperform their non-reasoning counterparts, especially on complex financial tasks. Moreover, on the real-world CCC dataset, our single-call reasoning models match or even surpass the performance of multi-agent systems that require significantly more computational cost. These findings demonstrate the effectiveness of DianJin-R1 in enhancing financial reasoning through structured supervision and reward-aligned learning, offering a scalable and practical solution for real-world applications.
翻译:有效的推理能力仍然是大语言模型在金融领域面临的核心挑战,该领域的任务通常需要领域专业知识、精确的数值计算以及对合规规则的严格遵守。我们提出了点金-R1,这是一个推理增强框架,旨在通过推理增强监督和强化学习来解决这些挑战。我们方法的核心是点金-R1-数据,这是一个基于CFLUE、FinQA和一个专有合规语料库构建的高质量数据集,该语料库结合了多样化的金融推理场景与经过验证的标注。我们的模型点金-R1-7B和点金-R1-32B,是从Qwen2.5-7B-Instruct和Qwen2.5-32B-Instruct通过一种生成推理步骤和最终答案的结构化格式进行微调得到的。为了进一步提升推理质量,我们应用了组相对策略优化,这是一种强化学习方法,它结合了双重奖励信号:一个鼓励结构化输出,另一个奖励答案的正确性。我们在五个基准测试上评估了我们的模型:三个金融数据集和两个通用推理基准。实验结果表明,点金-R1模型始终优于其非推理对应的版本,尤其是在复杂的金融任务上。此外,在真实世界的CCC数据集上,我们单次调用的推理模型匹配甚至超越了需要显著更高计算成本的多智能体系统的性能。这些发现证明了点金-R1通过结构化监督和奖励对齐学习在增强金融推理方面的有效性,为现实世界的应用提供了一个可扩展且实用的解决方案。