Large language models (LLMs) are increasingly applied to financial analysis, yet their ability to audit structured financial statements under explicit accounting principles remains poorly explored. Existing benchmarks primarily evaluate question answering, numerical reasoning, or anomaly detection on synthetically corrupted data, making it unclear whether models can reliably verify or localize rule compliance on correct financial statements. We introduce FinRule-Bench, a benchmark for evaluating diagnostic completeness in rule-based financial reasoning over real-world financial tables. FinRule-Bench pairs ground-truth financial statements with explicit, human-curated accounting principles and spans four canonical statement types: Balance Sheets, Cash Flow Statements, Income Statements, and Statements of Equity. The benchmark defines three auditing tasks that require progressively stronger reasoning capabilities: (i) rule verification, which tests compliance with a single principle; (ii) rule identification, which requires selecting the violated principle from a provided rule set; and (iii) joint rule diagnosis, which requires detecting and localizing multiple simultaneous violations at the record level. We evaluate LLMs under zero-shot and few-shot prompting, and introduce a causal-counterfactual reasoning protocol that enforces consistency between decisions, explanations, and counterfactual judgments. Across tasks and statement types, we find that while models perform well on isolated rule verification, performance degrades sharply for rule discrimination and multi-violation diagnosis. FinRule-Bench provides a principled and reproducible testbed for studying rule-governed reasoning, diagnostic coverage, and failure modes of LLMs in high-stakes financial analysis.
翻译:大语言模型(LLMs)在金融分析中的应用日益增多,但其在明确会计准则下对结构化财务报表进行审计的能力仍缺乏深入探索。现有基准主要评估基于合成篡改数据的问答、数值推理或异常检测,难以确定模型能否在正确的财务报表上可靠地验证或定位规则合规性。本文提出FinRule-Bench,这是一个用于评估基于规则的金融推理在真实世界金融表格中诊断完备性的基准。FinRule-Bench将真实财务报表与明确的人工编制会计准则相结合,涵盖四种标准报表类型:资产负债表、现金流量表、利润表及所有者权益变动表。该基准定义了三个需要渐进增强推理能力的审计任务:(i)规则验证,测试对单一准则的合规性;(ii)规则识别,要求从提供的规则集中选择被违反的准则;(iii)联合规则诊断,要求在记录级别检测并定位多个同时存在的违规行为。我们在零样本和少样本提示下评估LLMs,并引入一种因果-反事实推理协议,以确保决策、解释与反事实判断之间的一致性。跨任务与报表类型的实验表明,虽然模型在独立规则验证上表现良好,但在规则判别和多违规诊断任务中性能急剧下降。FinRule-Bench为研究高风险金融分析中LLMs的规则约束推理、诊断覆盖范围及失效模式提供了一个原则化且可复现的测试平台。