Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.
翻译:财务报告中数值数据的准确解读对市场及监管机构至关重要。尽管XBRL(可扩展商业报告语言)为财务数据的标记提供了标准,但将数千个事实映射至超过一万个美国通用会计准则概念的过程仍成本高昂且易出错。现有基准测试将此任务简化为对少数概念子集的扁平化单步分类,忽略了分类体系的层级语义及财务文档的结构化特性,因此无法在真实报告场景中评估大语言模型(LLM)的性能。为填补这一空白,我们提出FinTagging——首个面向结构感知与全范围XBRL标记的综合基准。我们将复杂标记流程分解为两个子任务:(1) 财务数值识别(FinNI),从文本与表格等异构上下文中提取实体及其类型;(2) 财务概念链接(FinCL),将提取的实体映射至完整美国通用会计准则分类体系。这种两阶段设计使得对LLM在数值推理与分类体系对齐能力上的公平评估成为可能。对多种LLM的零样本评估显示,尽管模型在提取任务中表现良好,但在细粒度概念链接上仍存在显著困难,揭示了其在领域特定结构感知推理方面的关键局限性。