Retrieval Augmented Generation (RAG) enables Large Language Models (LLMs) to generalize to new information by decoupling reasoning capabilities from static knowledge bases. Traditional RAG enhancements have explored vertical scaling -- assigning subtasks to specialized modules -- and horizontal scaling -- replicating tasks across multiple agents -- to improve performance. However, real-world applications impose diverse Service Level Agreements (SLAs) and Quality of Service (QoS) requirements, involving trade-offs among objectives such as reducing cost, ensuring answer quality, and adhering to specific operational constraints. In this work, we present a systems-oriented approach to multi-agent RAG tailored for real-world Question Answering (QA) applications. By integrating task-specific non-functional requirements -- such as answer quality, cost, and latency -- into the system, we enable dynamic reconfiguration to meet diverse SLAs. Our method maps these Service Level Objectives (SLOs) to system-level parameters, allowing the generation of optimal results within specified resource constraints. We conduct a case study in the QA domain, demonstrating how dynamic re-orchestration of a multi-agent RAG system can effectively manage the trade-off between answer quality and cost. By adjusting the system based on query intent and operational conditions, we systematically balance performance and resource utilization. This approach allows the system to meet SLOs for various query types, showcasing its practicality for real-world applications.
翻译:检索增强生成(RAG)通过将推理能力与静态知识库解耦,使大型语言模型(LLM)能够泛化至新信息。传统RAG增强方法通过纵向扩展(将子任务分配给专用模块)和横向扩展(在多个智能体间复制任务)来提升性能。然而,实际应用场景中存在多样化的服务等级协议(SLA)与服务质量(QoS)要求,需要在降低开销、保证答案质量、遵守特定操作约束等目标间进行权衡。本研究提出一种面向实际问答(QA)应用的多智能体RAG系统化方法。通过将任务特定的非功能性需求(如答案质量、开销与延迟)整合至系统,我们实现了满足多样化SLA的动态重构能力。该方法将服务等级目标(SLO)映射至系统级参数,从而在指定资源约束下生成最优结果。我们在QA领域开展案例研究,论证了多智能体RAG系统的动态重组如何有效管理答案质量与开销间的平衡关系。通过基于查询意图与操作状态调整系统配置,我们实现了性能与资源利用的系统性权衡。该方法使系统能够满足各类查询的SLO要求,展现了其在现实应用中的实用性。