Finance reporting is a natural proving ground for large language models, and the very-long-context capabilities of recent models across all sizes make rigorous evaluation in this domain an increasingly pressing need. Yet most public financial resources reduce the task to plain-text SEC 10-K filings paired with a handful of question-answer items. We release LEDGER (Long-context Evaluation of Documents for Grounded Extraction and Retrieval), a corpus of 4,999 digitized corporate annual reports - full documents with figures, tables, and narrative, not just regulatory filings. Each report is labeled with 31 consolidated financial KPIs to be extracted and linked to the market's reaction at the earnings date. From this data we derive three evaluation benchmarks spanning the difficulty spectrum: a pure page-level KPI retrieval task with TREC-style relevance judgments over 118,048 questions in natural language, a conversational "needle-in-a-haystack" single-value lookup, and a full KPI extraction task, both from long, numerically dense reports. We additionally provide human OCR-quality annotations with inter-annotator agreement and the complete extraction, validation, and scoring toolchain. We further demonstrate the dataset's research utility with a case study linking CEO-letter rhetoric to post-publication market impact.
翻译:金融报告是大语言模型的天然试验场,近期各规模模型展现出的超长上下文能力,使得在该领域进行严谨评估的需求日益迫切。然而,多数公开金融资源仅将任务简化为纯文本的SEC 10-K报告及少量问答条目。我们提出LEDGER(面向文档的长上下文奠基提取与检索评估),这是一个包含4,999份数字化企业年报的语料库——这些完整文档包含图表、表格和叙述性内容,而不仅是监管文件。每份报告标注了31个需提取的合并财务关键绩效指标(KPI),并与财报发布日的市场反应建立关联。基于该数据,我们构建了覆盖不同难度层次的三个评估基准:纯页面级KPI检索任务(含118,048个自然语言问题,采用TREC风格相关性判定)、对话式“大海捞针”单值查找任务,以及完整的KPI抽取任务(均基于长篇幅、高数值密度的文档)。我们还提供具有人工OCR质量标注和标注者间一致性验证的数据集,以及完整的抽取、验证与评分工具链。最后,通过将CEO信函修辞与发布后市场影响相关联的案例研究,我们进一步展示了该数据集的研究实用性。