While large language models (LLMs) have shown strong general reasoning capabilities, their effectiveness in financial reasoning, which is crucial for real-world financial applications remains underexplored. In this study, we conduct a comprehensive evaluation of 24 state-of-the-art general and reasoning-focused LLMs across four complex financial reasoning tasks involving financial text, tabular data, and equations. We assess key capabilities such as numerical reasoning, tabular interpretation, financial terminology comprehension, long-context understanding, and equation-based problem solving. Our analysis reveals that while data quality and pretraining contribute to performance, general techniques like chain-of-thought (CoT) fine-tuning offer limited gains in financial tasks. To address this, we propose two domain-adapted models, Fino1-8B and Fino1-14B, trained with CoT fine-tuning and reinforcement learning using domain-specific reasoning paths. Our models are trained on a carefully curated dataset integrating high-quality examples from diverse sources, covering financial reports, tables, equations, and structured XBRL texts. Despite limited training data, they achieve an 7-9% performance improvement, outperforming several advanced LLMs, including GPT-o1, GPT-o3-mini, GPT-4.5, and comparable with DeepSeek models (V3 and R1), demonstrating strong practical value in resource, constrained scenarios. Our findings highlight the need for domain-specific adaptations in financial reasoning, and we release all datasets, models, and code for future research.
翻译:尽管大语言模型(LLM)已展现出强大的通用推理能力,但其在金融推理任务中的有效性——这对现实世界金融应用至关重要——仍未得到充分探索。本研究对24个最先进的通用及推理专用LLM在四项涉及金融文本、表格数据和公式的复杂金融推理任务上进行了全面评估。我们评估了数值推理、表格解读、金融术语理解、长上下文理解及基于公式的问题求解等关键能力。分析表明,虽然数据质量和预训练对性能有贡献,但思维链(CoT)微调等通用技术在金融任务中带来的提升有限。为此,我们提出了两个领域自适应模型Fino1-8B和Fino1-14B,它们通过使用领域特定推理路径进行CoT微调和强化学习训练而成。我们的模型基于精心构建的数据集进行训练,该数据集整合了来自金融报告、表格、公式及结构化XBRL文本等多种来源的高质量示例。尽管训练数据有限,这些模型实现了7-9%的性能提升,超越了包括GPT-o1、GPT-o3-mini、GPT-4.5在内的多个先进LLM,并与DeepSeek模型(V3和R1)表现相当,在资源受限场景中展现出强大的实用价值。我们的研究结果凸显了金融推理领域进行领域特定适配的必要性,并公开了全部数据集、模型及代码以供未来研究。