Recent advances in agentic large language models (LLMs) have substantially improved Text-to-SQL, enabling users without database expertise to query databases intuitively. However, deploying agentic LLM-based Text-to-SQL systems in production remains challenging due to multi-stage dependencies, strict latency requirements, and deployment complexity across heterogeneous GPUs in enterprise clusters. Existing LLM serving frameworks are designed mainly for independent inference tasks, leading to suboptimal performance and frequent service-level objective (SLO) violations for Text-to-SQL workloads. In this paper, we introduce \sys, a framework for scheduling and executing agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant requests. \sys adopts a hierarchical scheduler that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. We also propose a lightweight simulation-based method to tune key scheduling hyperparameters, improving robustness and adaptability. Evaluations on realistic Text-to-SQL benchmarks show that \sys significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, \sys reduces P95 tail latency by $1.42{\sim}1.56\times$ and increases throughput by $1.49{\sim}1.81\times$, demonstrating consistent gains under diverse workloads.
翻译:暂无翻译