HEXGEN-FLOW：面向智能体式文本到SQL的大语言模型推理请求调度优化 (HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL)

Recent advancements in leveraging the agentic paradigm of large language models (LLMs) have substantially improved Text-to-SQL capabilities, empowering users without specialized database knowledge to intuitively query databases. However, deploying agentic LLM-based Text-to-SQL systems in production presents significant challenges, stemming from their inherently multi-stage computational dependencies, strict latency requirements, and the complexity of deployment across heterogeneous GPUs widely existing in enterprise clusters. Meanwhile, existing LLM serving frameworks are primarily designed for independent inference tasks, resulting in suboptimal performance and frequent service-level objective (SLO) violations in Text-to-SQL workloads. In this paper, we introduce HEXGEN-FLOW, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant Text-to-SQL requests. HEXGEN-FLOW introduces a hierarchical scheduling approach that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-FLOW significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, HEXGEN-FLOW reduces P95 tail latency by $1.42{\sim}1.56\times$ and increases throughput by $1.49{\sim}1.81\times$, demonstrating robust improvements under diverse workloads. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.

翻译：近期，利用大语言模型（LLMs）智能体范式的研究显著提升了文本到SQL的能力，使得不具备专业数据库知识的用户能够直观地查询数据库。然而，将基于智能体LLM的文本到SQL系统部署到生产环境中面临重大挑战，这源于其固有的多阶段计算依赖性、严格的延迟要求，以及在广泛存在于企业集群的异构GPU上部署的复杂性。同时，现有的LLM服务框架主要针对独立推理任务设计，导致在文本到SQL工作负载中性能欠佳且频繁违反服务级别目标（SLO）。本文中，我们介绍了HEXGEN-FLOW，这是一个专为在异构GPU集群上调度和执行服务于多租户文本到SQL请求的智能体多阶段LLM文本到SQL工作流而设计的新型框架。HEXGEN-FLOW引入了一种分层调度方法，结合了全局负载均衡的任务分发与自适应本地优先级队列，该方法基于对智能体文本到SQL工作流的系统分析指导。此外，我们提出了一种轻量级的基于仿真的方法来调整关键调度超参数，进一步增强了鲁棒性和适应性。我们在真实的文本到SQL基准测试上的评估表明，HEXGEN-FLOW显著优于最先进的LLM服务框架。在所有跟踪数据中，HEXGEN-FLOW将P95尾部延迟降低了$1.42{\sim}1.56\times$，并将吞吐量提高了$1.49{\sim}1.81\times$，展示了在不同工作负载下的稳健改进。我们的代码可在https://github.com/Relaxed-System-Lab/Hexgen-Flow获取。