Retrieval-Augmented Generation (RAG) expands the knowledge boundary of large language models (LLMs) at inference by retrieving external documents as context. However, retrieval becomes increasingly time-consuming as the knowledge databases grow in size. Existing acceleration strategies either compromise accuracy through approximate retrieval, or achieve marginal gains by reusing results of strictly identical queries. We propose HaS, a homology-aware speculative retrieval framework that performs low-latency speculative retrieval over restricted scopes to obtain candidate documents, followed by validating whether they contain the required knowledge. The validation, grounded in the homology relation between queries, is formulated as a homologous query re-identification task: once a previously observed query is identified as a homologous re-encounter of the incoming query, the draft is deemed acceptable, allowing the system to bypass slow full-database retrieval. Benefiting from the prevalence of homologous queries under real-world popularity patterns, HaS achieves substantial efficiency gains. Extensive experiments demonstrate that HaS reduces retrieval latency by 23.74% and 36.99% across datasets with only a 1-2% marginal accuracy drop. As a plug-and-play solution, HaS also significantly accelerates complex multi-hop queries in modern agentic RAG pipelines. Source code is available at: https://github.com/ErrEqualsNil/HaS.
翻译:检索增强生成(RAG)通过在推理阶段检索外部文档作为上下文,扩展了大语言模型(LLMs)的知识边界。然而,随着知识数据库规模的增大,检索过程日益耗时。现有加速策略要么通过近似检索牺牲准确性,要么通过复用完全相同查询的结果获得边际性提升。我们提出HaS——一种同源感知的投机检索框架,该框架在受限范围内执行低延迟的投机检索以获得候选文档,随后验证这些文档是否包含所需知识。基于查询间同源关系的验证过程被建模为同源查询重识别任务:一旦先前观察到的查询被识别为当前查询的同源重复,候选结果即被视为可接受,从而使得系统能够绕过耗时的全量数据库检索。得益于现实场景流行模式中同源查询的普遍性,HaS实现了显著的效率提升。大量实验表明,HaS在各数据集上仅以1-2%的边际精度下降为代价,将检索延迟分别降低了23.74%和36.99%。作为一种即插即用方案,HaS还能显著加速现代智能RAG流程中的复杂多跳查询。源代码可从https://github.com/ErrEqualsNil/HaS获取。