Recent advances in agentic large language models (LLMs) have substantially improved Text-to-SQL, enabling users without database expertise to query databases intuitively. However, deploying agentic LLM-based Text-to-SQL systems in production remains challenging due to multi-stage dependencies, strict latency requirements, and deployment complexity across heterogeneous GPUs in enterprise clusters. Existing LLM serving frameworks are designed mainly for independent inference tasks, leading to suboptimal performance and frequent service-level objective (SLO) violations for Text-to-SQL workloads. In this paper, we introduce \sys, a framework for scheduling and executing agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant requests. \sys adopts a hierarchical scheduler that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. We also propose a lightweight simulation-based method to tune key scheduling hyperparameters, improving robustness and adaptability. Evaluations on realistic Text-to-SQL benchmarks show that \sys significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, \sys reduces P95 tail latency by $1.42{\sim}1.56\times$ and increases throughput by $1.49{\sim}1.81\times$, demonstrating consistent gains under diverse workloads.
翻译:近年来,智能体化大语言模型(LLM)的进展显著提升了文本到SQL(Text-to-SQL)的能力,使得不具备数据库专业知识的用户能够直观地查询数据库。然而,在生产环境中部署基于智能体化LLM的文本到SQL系统仍面临诸多挑战,包括多阶段依赖关系、严格的延迟要求,以及在企业级异构GPU集群上的部署复杂性。现有的LLM服务框架主要针对独立推理任务设计,导致在处理文本到SQL工作负载时性能欠佳,并频繁违反服务级别目标(SLO)。本文提出\sys,一个用于在服务多租户请求的异构GPU集群上调度和执行基于LLM的智能体化多阶段文本到SQL工作流的框架。\sys采用分层调度器,结合了全局负载均衡的任务分发与自适应本地优先级队列,其设计基于对智能体化文本到SQL工作流的系统性分析。我们还提出一种基于轻量级仿真的方法来调优关键调度超参数,从而提升系统的鲁棒性和适应性。在真实的文本到SQL基准测试上的评估表明,\sys显著优于当前最先进的LLM服务框架。在所有测试轨迹中,\sys将P95尾部延迟降低了$1.42{\sim}1.56\times$,并将吞吐量提高了$1.49{\sim}1.81\times$,这证明了其在多样化工作负载下均能获得稳定的性能提升。