FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

Yan Wang,Keyi Wang,Shanshan Yang,Jaisal Patel,Jeff Zhao,Fengran Mo,Xueqing Peng,Lingfei Qian,Yankai Chen,Víctor Gutiérrez-Basulto,Jimin Huang,Guojun Xiong,Xiao-Yang Liu,Xue Liu,Jian-Yun Nie

from arxiv, Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.

翻译：超出简单的文本处理范畴，金融审计要求检测大规模披露信息中的语义、结构和数值不一致性。由于财务报告以XBRL（一种遵循会计准则的结构化XML格式）提交，审计便成为一个涉及概念对齐、分类法定义关系及跨文档一致性的结构化信息提取与推理问题。尽管大型语言模型（LLMs）在孤立金融任务上展现出潜力，但其在专业级审计中的能力仍不明确。我们提出FinAuditing——一个基于真实XBRL申报文件、面向分类法对齐且具备结构感知能力的基准。该基准包含1,102个标注实例，平均每个实例超过33k token，并定义了三个任务：金融语义匹配（FinSM）、金融关系抽取（FinRE）和金融数学推理（FinMR）。对13个最先进LLMs的评估揭示了模型在概念检索、分类法感知关系建模及跨文档一致性推理方面的显著差距。这些发现凸显了构建真实且具备结构感知能力基准的必要性。我们已在https://github.com/The-FinAI/FinAuditing 发布评估代码，数据集可从https://huggingface.co/collections/TheFinAI/finauditing 获取。该任务目前作为一项正在进行的公开评估竞赛的官方基准，详见https://open-finance-lab.github.io/SecureFinAI_Contest_2026/。