Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
翻译:基于检索增强生成(RAG)的大型语言模型(LLM)因其在知识密集型任务上的卓越性能而广泛应用于金融领域。然而,标准化文档(例如SEC申报文件)具有相似的格式,例如重复的样板文本和类似的表格结构。这种相似性导致传统的RAG方法容易误判近似重复文本,从而引发重复检索,损害了答案的准确性和完整性。为解决这些问题,我们提出了基于证据筛选的分层检索(HiREC)框架。我们的方法首先执行分层检索以减少相似文本间的混淆:先检索相关文档,再从文档中选择最相关的段落。证据筛选过程会移除不相关的段落。在必要时,它会自动生成补充查询以收集缺失信息。为评估我们的方法,我们构建并发布了一个大规模开放领域金融(LOFin)问答基准,包含145,897份SEC文档和1,595个问答对。我们的代码和数据可在 https://github.com/deep-over/LOFin-bench-HiREC 获取。