Answering questions within business and finance requires reasoning, precision, and a wide-breadth of technical knowledge. Together, these requirements make this domain difficult for large language models (LLMs). We introduce BizBench, a benchmark for evaluating models' ability to reason about realistic financial problems. BizBench comprises eight quantitative reasoning tasks, focusing on question-answering (QA) over financial data via program synthesis. We include three financially-themed code-generation tasks from newly collected and augmented QA data. Additionally, we isolate the reasoning capabilities required for financial QA: reading comprehension of financial text and tables for extracting intermediate values, and understanding financial concepts and formulas needed to calculate complex solutions. Collectively, these tasks evaluate a model's financial background knowledge, ability to parse financial documents, and capacity to solve problems with code. We conduct an in-depth evaluation of open-source and commercial LLMs, comparing and contrasting the behavior of code-focused and language-focused models. We demonstrate that the current bottleneck in performance is due to LLMs' limited business and financial understanding, highlighting the value of a challenging benchmark for quantitative reasoning within this domain.
翻译:回答商业与金融领域的问题需要推理能力、精确性以及广泛的技术知识。这些要求共同构成了大型语言模型在该领域的难点。我们提出BizBench,一个用于评估模型解决现实金融问题能力的基准。BizBench包含八项定量推理任务,重点是通过程序合成实现金融数据上的问答。我们纳入了三项基于新收集和扩充的问答数据的金融主题代码生成任务。此外,我们分离了金融问答所需的核心推理能力:从金融文本和表格中提取中间值的阅读理解能力,以及理解计算复杂解决方案所需的金融概念和公式的能力。这些任务共同评估模型的金融背景知识、解析金融文档的能力以及用代码解决问题的能力。我们对开源和商用大型语言模型进行了深入评估,比较了以代码为核心和以语言为核心的模型的行为差异。我们证明当前性能瓶颈源于大型语言模型在商业与金融理解上的局限性,这凸显了在该领域构建具有挑战性的定量推理基准的价值。