With increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. They do not distinguish whether errors stem from retrieval failures, generation flaws, finance-specific reasoning mistakes, or misunderstanding of the query or context. This makes it difficult to pinpoint performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirror financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This is driven by rising comparison hallucinations, time and entity mismatches, and mirrored by declines in reasoning and factuality--limitations that prior benchmarks have yet to formally categorize or quantify.
翻译:随着大型语言模型(LLM)在金融领域的部署日益增多,人们越来越期望LLM能够解析复杂的监管披露文件。然而,现有基准测试通常侧重于孤立的细节,未能反映专业分析的复杂性,这种分析需要综合跨多个文档、报告周期和公司实体的信息。它们无法区分错误是源于检索失败、生成缺陷、特定于金融领域的推理错误,还是对查询或上下文的理解偏差。这使得准确定位性能瓶颈变得困难。为了弥合这些差距,我们提出了Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准测试,它通过三种路径模拟金融分析师的工作流程:面向单个披露文件内部的细节推理、共享主题下的跨实体比较,以及同一公司在不同报告周期内的纵向追踪。我们在真实上下文和检索增强两种设置下,对17个领先的LLM进行了基准测试,涵盖开源、闭源和金融专用模型。结果显示,随着任务从单文档推理转向纵向和跨实体分析,模型性能显著下降,准确率分别降低了18.60%和14.35%。这主要是由于比较性幻觉、时间与实体错配的增加所驱动,并伴随着推理能力和事实性指标的下降——这些局限性是先前基准测试尚未正式分类或量化的。