Decomposing Retrieval Failures in RAG for Long-Document Financial Question Answering

Retrieval-augmented generation is increasingly used for financial question answering over long regulatory filings, yet reliability depends on retrieving the exact context needed to justify answers in high stakes settings. We study a frequent failure mode in which the correct document is retrieved but the page or chunk that contains the answer is missed, leading the generator to extrapolate from incomplete context. Despite its practical significance, this within-document retrieval failure mode has received limited systematic attention in the Financial Question Answering (QA) literature. We evaluate retrieval at multiple levels of granularity, document, page, and chunk level, and introduce an oracle based analysis to provide empirical upper bounds on retrieval and generative performance. On a 150 question subset of FinanceBench, we reproduce and compare diverse retrieval strategies including dense, sparse, hybrid, and hierarchical methods with reranking and query reformulation. Across methods, gains in document discovery tend to translate into stronger page recall, yet oracle performance still suggests headroom for page and chunk level retrieval. To target this gap, we introduce a domain fine-tuned page scorer that treats pages as an intermediate retrieval unit between documents and chunks. Unlike prior passage-based hierarchical retrieval, we fine-tune a bi-encoder specifically for page-level relevance on financial filings, exploiting the semantic coherence of pages. Overall, our results demonstrate a significant improvement in page recall and chunk retrieval.

翻译：检索增强生成技术越来越多地应用于长篇幅监管文件的金融问答任务，然而在高风险场景中，其可靠性取决于能否检索到支撑答案所需的确切上下文。我们研究了一种常见的失败模式：虽然检索到了正确的文档，却遗漏了包含答案的具体页面或文本块，导致生成器基于不完整的上下文进行推断。尽管这种文档内检索失败模式具有重要的实际意义，但在金融问答研究领域尚未得到系统性的充分关注。我们评估了文档、页面和文本块三个粒度层次的检索效果，并引入基于理想检索器的分析方法，为检索和生成性能提供经验性上限。在FinanceBench的150个问题子集上，我们复现并比较了多种检索策略，包括稠密检索、稀疏检索、混合检索以及结合重排序与查询重构的分层检索方法。所有方法中，文档发现率的提升往往能转化为更强的页面召回率，但理想检索器的性能表明页面和文本块层次的检索仍有改进空间。为弥补这一差距，我们提出了一种领域微调的页面评分器，将页面视为介于文档和文本块之间的中间检索单元。不同于以往基于段落的分层检索方法，我们专门针对金融文件页面级相关性微调了一个双编码器，以利用页面的语义连贯性。总体而言，我们的研究结果表明该方法能显著提升页面召回率和文本块检索效果。