Retrieval-Augmented Generation (RAG) grounds Large Language Models (LLMs) to mitigate factual hallucinations. Recent paradigms shift from static pipelines to Modular and Agentic RAG frameworks, granting models autonomy for multi-hop reasoning or self-correction. However, current reflective RAG heavily relies on massive LLMs as universal evaluators. In high-throughput systems, executing complete forward passes for billion-parameter models merely for binary routing introduces severe computational redundancy. Furthermore, in autonomous agent scenarios, inaccurate retrieval causes models to expend excessive tokens on spurious reasoning and redundant tool calls, inflating Time-to-First-Token (TTFT) and costs. We propose Tiny-Critic RAG, decoupling evaluation by deploying a parameter-efficient Small Language Model (SLM) via Low-Rank Adaptation (LoRA). Acting as a deterministic gatekeeper, Tiny-Critic employs constrained decoding and non-thinking inference modes for ultra-low latency binary routing. Evaluations on noise-injected datasets demonstrate Tiny-Critic RAG achieves routing accuracy comparable to GPT-4o-mini while reducing latency by an order of magnitude, establishing a highly cost-effective paradigm for agent deployment.
翻译:检索增强生成(RAG)通过将大语言模型(LLMs)与外部知识源相结合,有效缓解事实性幻觉问题。近期研究范式正从静态流程转向模块化与智能代理式RAG框架,赋予模型进行多跳推理或自我修正的自主能力。然而,当前具备反思能力的RAG系统严重依赖大规模LLM作为通用评估器。在高吞吐量系统中,仅为执行二元路由决策而对数十亿参数模型进行完整前向传播,会引入严重的计算冗余。此外,在自主代理场景中,不准确的检索会导致模型在无效推理和冗余工具调用上消耗过多令牌,显著增加首令牌生成时间(TTFT)与成本。本文提出Tiny-Critic RAG框架,通过低秩自适应(LoRA)部署参数高效的小语言模型(SLM)来实现评估模块的解耦。该框架作为确定性守门员,采用约束解码与非思维推理模式实现超低延迟的二元路由。在注入噪声的数据集上的评估表明,Tiny-Critic RAG在路由准确度上达到与GPT-4o-mini相当的水平,同时将延迟降低一个数量级,为智能代理部署建立了高性价比的新范式。