Retrieval-augmented generation (RAG) based large language models (LLMs) are widely used in finance for their excellent performance on knowledge-intensive tasks. However, standardized documents (e.g., SEC filing) share similar formats such as repetitive boilerplate texts, and similar table structures. This similarity forces traditional RAG methods to misidentify near-duplicate text, leading to duplicate retrieval that undermines accuracy and completeness. To address these issues, we propose the Hierarchical Retrieval with Evidence Curation (HiREC) framework. Our approach first performs hierarchical retrieval to reduce confusion among similar texts. It first retrieve related documents and then selects the most relevant passages from the documents. The evidence curation process removes irrelevant passages. When necessary, it automatically generates complementary queries to collect missing information. To evaluate our approach, we construct and release a Large-scale Open-domain Financial (LOFin) question answering benchmark that includes 145,897 SEC documents and 1,595 question-answer pairs. Our code and data are available at https://github.com/deep-over/LOFin-bench-HiREC.
翻译:基于检索增强生成(RAG)的大型语言模型(LLM)因其在知识密集型任务中的卓越表现,在金融领域得到广泛应用。然而,标准化文档(如SEC申报文件)具有相似的格式,例如重复的样板文本和相近的表格结构。这种相似性导致传统RAG方法易误判近似重复文本,引发重复检索,从而损害答案的准确性与完整性。为解决这些问题,我们提出了分层检索与证据整理(HiREC)框架。该方法首先执行分层检索以减少相似文本间的混淆:先检索相关文档,再从文档中筛选最相关的段落。证据整理过程会剔除无关段落,并在必要时自动生成补充查询以收集缺失信息。为评估本方法,我们构建并发布了大规模开放域金融(LOFin)问答基准数据集,包含145,897份SEC文档和1,595组问答对。代码与数据已公开于https://github.com/deep-over/LOFin-bench-HiREC。