Large language models (LLMs) have shown strong performance in financial analysis and surface-level factual error detection, yet their ability to identify fraudulent financial misinformation in audited corporate reporting remains underexplored. Existing financial and audit benchmarks mainly focus on factual verification, numerical reasoning, rule compliance, or audit workflows, but rarely evaluate misleading disclosure narratives or management explanations that obscure the true drivers of reported performance. We introduce AuditFraudBench, an enforcement-grounded benchmark constructed from authentic company filings and regulatory materials, including original and restated 10-K and 10-Q filings, structured financial statements, MD&A disclosures, and SEC Accounting and Auditing Enforcement Releases (AAERs). AuditFraudBench contains three tasks: Profit Source Attribution, Misleading Narrative Detection, and Fraud Pattern Classification, which evaluate whether models can identify the true source of reported performance, detect misleading disclosure framing, and classify misconduct mechanisms into known manipulation patterns. We evaluate GPT, DeepSeek, and Qwen series LLMs on the benchmark. Results show that both proprietary and open models still struggle to jointly reason over financial figures, disclosure framing, restatement evidence, and enforcement-grounded fraud mechanisms. AuditFraudBench provides a challenging testbed for audit-relevant, evidence-grounded evaluation of LLMs in financial reporting.
翻译:大型语言模型(LLM)在财务分析及表面事实错误检测中展现出较强性能,但其识别经审计企业报告中欺诈性财务信息的能力仍待深入探究。现有金融与审计基准主要聚焦于事实核查、数值推理、规则合规性及审计工作流,鲜少评估对掩盖报告业绩真实动因的误导性披露叙事或管理层解释。本文提出AuditFraudBench——一个基于真实公司备案文件与监管材料的执法导向基准,涵盖原始及重述后的10-K和10-Q文件、结构化财务报表、MD&A披露以及美国证券交易委员会会计与审计执法公告(AAER)。该基准包含三个任务:利润来源归因、误导性叙事检测及欺诈模式分类,旨在评估模型能否识别报告业绩的真实来源、检测误导性披露框架,并将不当行为机制归类至已知操纵模式。我们基于该基准评估了GPT、DeepSeek及Qwen系列LLM。结果表明,无论是专有模型还是开源模型,在联合推理财务数据、披露框架、重述证据及执法导向欺诈机制方面仍存在困难。AuditFraudBench为LLM在财务报告领域的审计相关、证据导向评估提供了具有挑战性的测试平台。