Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.
翻译:基于大语言模型(LLM)的应用通常包含LLM组件与非LLM组件,二者共同影响端到端延迟。尽管LLM推理优化已投入大量努力,端到端工作流优化却长期被忽视。现有框架采用基于任务模块的粗粒度编排策略,将优化局限在各模块内部,导致调度决策次优。我们提出细粒度端到端编排方案,以任务原语为基本单元,将每个查询的工作流表示为原语级数据流图。该方法显式地暴露了更大的设计空间,支持跨不同模块原语的并行化与流水线优化,并通过增强调度提升应用级性能。基于此方案,我们构建了新型编排框架Teola。综合实验表明,在多种主流LLM应用中,Teola相比现有系统最高可实现2.09倍的加速。