BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.

翻译：基于大型语言模型（LLM）的聊天机器人正日益广泛应用于金融领域，特别是在数字银行中，用于处理客户关于存款、储蓄和贷款等产品的咨询。然而，这些模型在核心银行业务计算中仍表现出较低的准确性——包括总支出估算、不同利率产品的比较以及提前还款条件下的利息计算。此类任务需要多步骤的数值推理和对银行产品的上下文理解，但现有LLM常出现系统性错误——误解产品类型、错误应用条件或无法完成涉及指数和等比数列的基本计算。然而，现有基准测试很少能捕捉到此类错误。现有数学数据集主要关注基础数学问题，而金融基准测试则主要针对金融文档，导致日常银行场景的研究不足。为弥补这一缺陷，我们提出了BankMathBench——一个反映真实银行业务任务的领域特定数据集。BankMathBench按难度分为三个层级：基础级、中级和高级，分别对应单一产品推理、多产品比较和多条件场景。当在BankMathBench上进行训练后，开源LLM在公式生成和数值推理准确性方面均表现出显著提升，证明了该数据集在增强领域特定推理能力方面的有效性。通过工具增强的微调，模型在三个层级的平均准确率分别提升了57.6个百分点（基础级）、75.1个百分点（中级）和62.9个百分点（高级），较零样本基线实现了显著进步。这些发现表明，BankMathBench可作为评估和推进LLM在真实银行场景中数值推理能力的可靠基准。