Retrieval-Augmented Generation (RAG) systems critically depend on retrieval quality, yet no systematic comparison of modern retrieval methods exists for heterogeneous documents containing both text and tabular data. We benchmark ten retrieval strategies spanning sparse, dense, hybrid fusion, cross-encoder reranking, query expansion, index augmentation, and adaptive retrieval on a challenging financial QA benchmark of 23,088 queries over 7,318 documents with mixed text-and-table content. We evaluate retrieval quality via Recall@k, MRR, and nDCG, and end-to-end generation quality via Number Match, with paired bootstrap significance testing. Our results show that (1) a two-stage pipeline combining hybrid retrieval with neural reranking achieves Recall@5 of 0.816 and MRR@3 of 0.605, outperforming all single-stage methods by a large margin; (2) BM25 outperforms state-of-the-art dense retrieval on financial documents, challenging the common assumption that semantic search universally dominates; and (3) query expansion methods (HyDE, multi-query) and adaptive retrieval provide limited benefit for precise numerical queries, while contextual retrieval yields consistent gains. We provide ablation studies on fusion methods and reranker depth, actionable cost-accuracy recommendations, and release our full benchmark code.
翻译:检索增强生成(RAG)系统的性能关键取决于检索质量,然而目前尚无针对包含文本和表格数据的异构文档的现代检索方法的系统比较。我们在包含7,318份混合文本与表格内容的文档、涵盖23,088个查询的金融问答基准上,对十种检索策略进行了基准测试,涵盖稀疏检索、稠密检索、混合融合、交叉编码器重排序、查询扩展、索引增强及自适应检索。我们通过Recall@k、MRR和nDCG评估检索质量,通过Number Match评估端到端生成质量,并采用配对自助法进行显著性检验。结果表明:(1)将混合检索与神经重排序相结合的两阶段流水线在Recall@5和MRR@3上分别达到0.816和0.605,显著优于所有单阶段方法;(2)BM25在金融文档上的表现优于最先进的稠密检索,挑战了语义搜索普遍占优的常见假设;(3)查询扩展方法(HyDE、多查询)和自适应检索对精确数值查询的增益有限,而上下文检索能带来持续改进。我们提供了融合方法和重排序器深度的消融研究、可操作的性价比建议,并公开了完整的基准测试代码。