With the increasing deployment of Large Language Models (LLMs) in the finance domain, LLMs are increasingly expected to parse complex regulatory disclosures. However, existing benchmarks often focus on isolated details, failing to reflect the complexity of professional analysis that requires synthesizing information across multiple documents, reporting periods, and corporate entities. Furthermore, these benchmarks do not disentangle whether errors arise from retrieval failures, generation inaccuracies, domain-specific reasoning mistakes, or misinterpretation of the query or context, making it difficult to precisely diagnose performance bottlenecks. To bridge these gaps, we introduce Fin-RATE, a benchmark built on U.S. Securities and Exchange Commission (SEC) filings and mirroring financial analyst workflows through three pathways: detail-oriented reasoning within individual disclosures, cross-entity comparison under shared topics, and longitudinal tracking of the same firm across reporting periods. We benchmark 17 leading LLMs, spanning open-source, closed-source, and finance-specialized models, under both ground-truth context and retrieval-augmented settings. Results show substantial performance degradation, with accuracy dropping by 18.60% and 14.35% as tasks shift from single-document reasoning to longitudinal and cross-entity analysis. This degradation is associated with increased comparison hallucinations, temporal and entity mismatches, and is further reflected in declines in reasoning quality and factual consistency--limitations that existing benchmarks have yet to formally categorize or quantify.
翻译:随着大语言模型(LLMs)在金融领域的广泛应用,其解析复杂监管披露文件的需求日益增长。然而现有基准测试多聚焦于孤立细节,未能反映专业分析中需跨多份文档、报告期及企业实体综合信息的复杂性。此外,这些基准未区分错误源于检索失败、生成不准确、领域特定推理失误还是查询/语境误判,导致难以精准诊断性能瓶颈。为填补上述空白,我们提出Fin-RATE——基于美国证券交易委员会(SEC)文件构建的基准测试,通过三条路径模拟金融分析师工作流程:单份披露文件内细节导向推理、同一主题下跨实体比较、同一企业跨报告期纵向追踪。我们评测了17个主流大语言模型(涵盖开源、闭源及金融专用模型),在真实背景与检索增强两种设定下进行。结果表明,当任务从单文档推理转向纵向与跨实体分析时,准确率分别下降18.60%和14.35%,并伴随比较幻觉、时间与实体错配增加,进一步反映在推理质量与事实一致性衰退上——这些局限性是现有基准尚未系统分类或量化的。