Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.
翻译:现代检索代理暴露了众多配置选项——包括大语言模型(LLM)、检索器、文档数量、跳数及合成策略——每个选项均影响答案质量与服务成本。当前,这些流水线通常针对每个工作负载进行一次性手动调优,忽略了大量面向单次查询的优化潜力。我们形式化该问题:给定自然语言查询及准确率或预算目标,从预定义的流水线目录中选择在推理时能够最小化成本或最大化准确率的配置。我们提出**BRANE**方法,利用LLM将每个查询转换为工作负载特定特征,随后训练轻量级逐配置预测器,评估流水线是否正确回答该查询。在推理阶段,**BRANE**选择最大化经成本惩罚的预测正确性的配置,无需重新训练即可实现可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench数据集上,**BRANE**持续推动成本-质量帕累托前沿,在匹配最优固定配置准确率的同时实现高达89%的成本降低,并优于基于LLM路由、规则及微调Qwen3-4B的基线方法。这些结果表明,全检索流水线的逐查询配置是静态工作负载级调优的实用替代方案。