EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements

Large Language Models (LLMs) have made remarkable progress, surpassing human performance on several benchmarks in domains such as mathematics and coding. A key driver of this progress has been the development of benchmark datasets. In contrast, the financial domain poses higher entry barriers due to its demand for specialized expertise, and benchmarks remain relatively scarce compared to those in mathematics or coding. We introduce EDINET-Bench, an open-source Japanese financial benchmark designed to evaluate LLMs on challenging tasks such as accounting fraud detection, earnings forecasting, and industry classification. EDINET-Bench is constructed from ten years of annual reports filed by Japanese companies. These tasks require models to process entire annual reports and integrate information across multiple tables and textual sections, demanding expert-level reasoning that is challenging even for human professionals. Our experiments show that even state-of-the-art LLMs struggle in this domain, performing only marginally better than logistic regression in binary classification tasks such as fraud detection and earnings forecasting. Our results show that simply providing reports to LLMs in a straightforward setting is not enough. This highlights the need for benchmark frameworks that better reflect the environments in which financial professionals operate, with richer scaffolding such as realistic simulations and task-specific reasoning support to enable more effective problem solving. We make our dataset and code publicly available to support future research.

翻译：大语言模型（LLMs）已取得显著进展，在数学和编程等多个领域的基准测试中超越了人类表现。这一进展的关键驱动力之一是基准数据集的开发。相比之下，金融领域因其对专业知识的较高要求而存在更高的准入门槛，与数学或编程领域相比，其基准测试仍相对匮乏。我们推出了EDINET-Bench，这是一个开源的日本金融基准测试集，旨在评估大语言模型在会计欺诈检测、盈利预测和行业分类等具有挑战性任务上的表现。EDINET-Bench基于日本公司提交的十年期年度报告构建。这些任务要求模型处理完整的年度报告，并整合跨多个表格和文本部分的信息，需要专家级的推理能力，即使对人类专业人士而言也具有挑战性。我们的实验表明，即使是最先进的大语言模型在该领域也表现不佳，在欺诈检测和盈利预测等二元分类任务中，其性能仅略优于逻辑回归。我们的结果表明，仅在简单设置下向大语言模型提供报告是不够的。这突显了需要开发能更好反映金融专业人士工作环境的基准框架，提供更丰富的支持结构，例如现实模拟和特定任务推理支持，以实现更有效的问题解决。我们公开提供数据集和代码，以支持未来研究。