Existing financial-NLP benchmarks mostly evaluate prepared artifacts such as filings, tables, or extracted values. Real accounting begins earlier: source documents must be reconciled into cited journal entries, aggregated into a balance sheet, and checked for contradictions. We introduce FinBalance, a multi-document accounting reconciliation benchmark built from source-document bundles across eight industries, three period types, and five difficulty levels. Human-authored business scenarios, accounting policies, tax/FX treatments, document schemas, distractors, and inconsistency templates are composed by a deterministic generator whose ledger produces journal entries,balance sheets, and 23 inconsistency-code labels. On a 710-record evaluation split, six contemporary LLMs reach at most 46% exact final-balance-sheet accuracy. Four models show a 26-41 pp gap between BS_exact, the model's reported balance sheet, and BS_recon, the balance sheet obtained by replaying its entries through our ledger. Models often recover numerically plausible entries but fail to bind them to supporting documents and aggregate them consistently. Citation-pressure prompting barely changes document-linking errors, while ledger-feedback ablations substantially improve reported balance sheets and expose inconsistency-detection trade-offs. Expert finance reviewers validate the benchmark design and labels.
翻译:现有的金融自然语言处理基准大多评估已准备好的材料,如申报文件、表格或提取的值。实际会计工作始于更早阶段:原始凭证必须核对至引用分录,汇总成资产负债表,并检查是否存在矛盾。我们提出FinBalance,这是一个基于跨八个行业、三种期间类型和五个难度等级的原始凭证包构建的多文档会计对账基准。人类撰写的业务场景、会计政策、税务/外汇处理规则、文档模式、干扰项及不一致模板由确定性生成器组合,其分类账产生分录、资产负债表及23种不一致代码标签。在710条记录评估集上,六个当代大型语言模型的最优最终资产负债表准确率仅为46%。四个模型在BS_exact(模型报告的资产负债表)与BS_recon(通过重放模型分录至分类账获得的资产负债表)之间显示出26-41个百分点的差距。模型常能恢复数值合理的分录,但未能将其与支持性文档绑定并一致汇总。引用压力提示几乎未改变文档链接错误,而分类账反馈消融实验显著改善了报告的资产负债表并暴露了不一致检测中的权衡。金融专家评审验证了基准设计与标签。