Retrieval-Augmented Generation (RAG) has emerged as the predominant paradigm for grounding Large Language Model outputs in factual knowledge, effectively mitigating hallucinations. However, conventional RAG systems operate under a "retrieve-always" assumption, querying vector databases for every input regardless of query complexity. This static approach incurs substantial computational overhead and inference latency, particularly problematic for high-throughput production deployments. We introduce L-RAG (Lazy Retrieval-Augmented Generation), an adaptive framework that implements hierarchical context management through entropy-based gating. L-RAG employs a two-tier architecture: queries are first processed with a compact document summary, and expensive chunk retrieval is triggered only when the model's predictive entropy exceeds a calibrated threshold, signaling genuine uncertainty. Through experiments on SQuAD 2.0 (N=500) using the Phi-2 model, we demonstrate that L-RAG provides a tunable accuracy-efficiency trade-off: at a conservative threshold (tau=0.5), L-RAG achieves 78.2% accuracy, matching Standard RAG (77.8%), with 8% retrieval reduction; at a balanced threshold (tau=1.0), retrieval reduction increases to 26% with modest accuracy trade-off (76.0%). Latency analysis shows that L-RAG saves 80-210ms per query when retrieval latency exceeds 500ms. Analysis of entropy distributions reveals statistically significant separation (p < 0.001) between correct predictions (H=1.72) and errors (H=2.20), validating entropy as a reliable uncertainty signal. L-RAG offers a practical, training-free approach toward more efficient RAG deployment, providing system architects with a configurable knob to balance accuracy and throughput requirements.
翻译:检索增强生成已成为将大型语言模型输出基于事实知识的主流范式,有效缓解了幻觉问题。然而,传统RAG系统遵循“始终检索”假设,无论查询复杂度如何,均对每个输入执行向量数据库查询。这种静态方法会产生大量计算开销和推理延迟,在高吞吐量生产部署中尤为突出。本文提出L-RAG(惰性检索增强生成),这是一种通过基于熵的门控机制实现分层上下文管理的自适应框架。L-RAG采用双层架构:首先使用精简文档摘要处理查询,仅当模型预测熵超过校准阈值(表明存在真实不确定性)时,才触发高成本的分块检索。通过在SQuAD 2.0数据集(N=500)上使用Phi-2模型进行实验,我们证明L-RAG提供了可调节的精度-效率权衡:在保守阈值(τ=0.5)下,L-RAG达到78.2%的准确率,与标准RAG(77.8%)相当,同时减少8%的检索量;在平衡阈值(τ=1.0)下,检索量减少提升至26%,而精度仅适度降低(76.0%)。延迟分析表明,当检索延迟超过500ms时,L-RAG可为每个查询节省80-210ms。熵分布分析显示正确预测(H=1.72)与错误预测(H=2.20)之间存在统计学显著差异(p < 0.001),验证了熵作为不确定性信号的可靠性。L-RAG为高效RAG部署提供了一种无需训练的实用方案,为系统架构师提供了可配置的调节机制以平衡精度与吞吐量需求。