With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60\% and 14.35\% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is driven by increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.
翻译:随着大型语言模型(LLM)在金融领域的部署日益增多,人们越来越期望LLM能够解析复杂的监管披露文件。然而,现有基准测试通常侧重于孤立的细节,未能反映需要综合多个文档、报告周期和公司实体信息的专业分析的复杂性。此外,这些基准测试无法区分错误是源于检索失败、生成不准确、特定领域推理错误,还是对查询或上下文的误解,从而难以精确诊断性能瓶颈。为弥补这些不足,我们提出了Fin-RATE,这是一个基于美国证券交易委员会(SEC)文件构建的基准测试,它通过三种路径模拟金融分析师的工作流程:针对单个披露文件的细节导向推理、共享主题下的跨实体比较,以及同一公司在不同报告周期内的纵向追踪。我们在真实上下文和检索增强两种设置下,对17个领先的LLM(涵盖开源、闭源及金融专用模型)进行了基准测试。结果显示,随着任务从单文档推理转向纵向和跨实体分析,模型性能显著下降,准确率分别降低了18.60%和14.35%。这种下降主要由增加的比较幻觉、时间和实体不匹配所驱动,并进一步体现在推理质量和事实一致性指标的下降上——这些局限性是现有基准测试尚未正式分类或量化的。