FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

Yan Wang,Lingfei Qian,Xueqing Peng,Yang Ren,Keyi Wang,Yi Han,Dongji Feng,Fengran Mo,Shengyuan Lin,Qinchuan Zhang,Kaiwen He,Chenri Luo,Jianxing Chen,Junwei Wu,Chen Xu,Ziyang Xu,Jimin Huang,Guojun Xiong,Xiao-Yang Liu,Qianqian Xie,Jian-Yun Nie

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over ten thousand US-GAAP concepts remains costly and error-prone. Existing benchmarks oversimplify this task as flat, single-step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. As a result, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure-aware and full-scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts such as text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US-GAAP taxonomy. This two-stage formulation enables a fair assessment of LLM capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero-shot settings shows that while models generalize well in extraction, they struggle with fine-grained concept linking, revealing important limitations in domain-specific, structure-aware reasoning. Code is available on GitHub, and datasets are available on Hugging Face.

翻译：财务报告中数值数据的准确解读对市场和监管机构至关重要。尽管XBRL（可扩展商业报告语言）为财务数据标记提供了标准，但将数千条事实映射到上万条US-GAAP概念仍成本高昂且易出错。现有基准将该任务过度简化为对小型概念子集的扁平化单步分类，忽略了分类法的层次语义与财务文档的结构化特性，导致无法在真实报告场景下评估大语言模型。为填补这一空白，我们提出首个面向结构感知与全范围XBRL标记的综合基准FinTagging。我们将复杂标记过程解构为两个子任务：（1）FinNI（财务数值识别），从文本和表格等异构上下文中提取实体及类型；（2）FinCL（财务概念关联），将提取的实体映射至完整US-GAAP分类法。这种两阶段框架能公平评估大语言模型在数值推理与分类法对齐方面的能力。在零样本设置下对多种大语言模型的评估表明：模型在提取任务中泛化良好，但在细粒度概念关联任务中表现欠佳，揭示了领域特定、结构感知推理方面的重要局限。代码发布于GitHub，数据集发布于Hugging Face。