Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications. The code is available at https://github.com/NetX-lab/Ayo.
翻译:基于大语言模型(LLM)的应用通常包含LLM组件与非LLM组件,二者共同影响端到端延迟。尽管已有大量工作致力于优化LLM推理,端到端工作流优化却长期被忽视。现有框架采用基于任务模块的粗粒度编排机制,将优化局限在各模块内部,导致调度决策次优。我们提出细粒度端到端编排方案,以任务原语为基本单元,将每个查询的工作流表示为原语级数据流图。该方法显式地揭示了更大的设计空间,支持跨模块原语的并行化与流水线优化,并通过增强调度机制提升应用级性能。基于此方案,我们构建了面向LLM应用的新型编排框架Teola。综合实验表明,在多种主流LLM应用中,Teola相比现有系统最高可实现2.09倍的加速效果。代码已开源:https://github.com/NetX-lab/Ayo。