Going beyond simple text processing, financial auditing requires detecting semantic, structural, and numerical inconsistencies across large-scale disclosures. As financial reports are filed in XBRL, a structured XML format governed by accounting standards, auditing becomes a structured information extraction and reasoning problem involving concept alignment, taxonomy-defined relations, and cross-document consistency. Although large language models (LLMs) show promise on isolated financial tasks, their capability in professional-grade auditing remains unclear. We introduce FinAuditing, a taxonomy-aligned, structure-aware benchmark built from real XBRL filings. It contains 1,102 annotated instances averaging over 33k tokens and defines three tasks: Financial Semantic Matching (FinSM), Financial Relationship Extraction (FinRE), and Financial Mathematical Reasoning (FinMR). Evaluations of 13 state-of-the-art LLMs reveal substantial gaps in concept retrieval, taxonomy-aware relation modeling, and consistent cross-document reasoning. These findings highlight the need for realistic, structure-aware benchmarks. We release the evaluation code at https://github.com/The-FinAI/FinAuditing and the dataset at https://huggingface.co/collections/TheFinAI/finauditing. The task currently serves as the official benchmark of an ongoing public evaluation contest at https://open-finance-lab.github.io/SecureFinAI_Contest_2026/.
翻译:超出简单的文本处理范畴,金融审计要求检测大规模披露信息中的语义、结构和数值不一致性。由于财务报告以XBRL(一种遵循会计准则的结构化XML格式)提交,审计便成为一个涉及概念对齐、分类法定义关系及跨文档一致性的结构化信息提取与推理问题。尽管大型语言模型(LLMs)在孤立金融任务上展现出潜力,但其在专业级审计中的能力仍不明确。我们提出FinAuditing——一个基于真实XBRL申报文件、面向分类法对齐且具备结构感知能力的基准。该基准包含1,102个标注实例,平均每个实例超过33k token,并定义了三个任务:金融语义匹配(FinSM)、金融关系抽取(FinRE)和金融数学推理(FinMR)。对13个最先进LLMs的评估揭示了模型在概念检索、分类法感知关系建模及跨文档一致性推理方面的显著差距。这些发现凸显了构建真实且具备结构感知能力基准的必要性。我们已在https://github.com/The-FinAI/FinAuditing 发布评估代码,数据集可从https://huggingface.co/collections/TheFinAI/finauditing 获取。该任务目前作为一项正在进行的公开评估竞赛的官方基准,详见https://open-finance-lab.github.io/SecureFinAI_Contest_2026/。