Large language model (LLM)-based applications consist of both LLM and non-LLM components, each contributing to the end-to-end latency. Despite great efforts to optimize LLM inference, end-to-end workflow optimization has been overlooked. Existing frameworks employ coarse-grained orchestration with task modules, which confines optimizations to within each module and yields suboptimal scheduling decisions. We propose fine-grained end-to-end orchestration, which utilizes task primitives as the basic units and represents each query's workflow as a primitive-level dataflow graph. This explicitly exposes a much larger design space, enables optimizations in parallelization and pipelining across primitives of different modules, and enhances scheduling to improve application-level performance. We build Teola, a novel orchestration framework for LLM-based applications that implements this scheme. Comprehensive experiments show that Teola can achieve up to 2.09x speedup over existing systems across various popular LLM applications.
翻译:基于大语言模型(LLM)的应用通常包含LLM组件与非LLM组件,二者共同影响端到端延迟。尽管已有大量工作致力于优化LLM推理,但端到端工作流优化尚未得到充分关注。现有框架采用基于任务模块的粗粒度编排方案,将优化限制在各模块内部,导致调度决策难以达到最优。我们提出细粒度端到端编排方法,该方法以任务原语为基本单元,将每个查询的工作流表示为原语级数据流图。该方案显式地暴露了更大的设计空间,支持跨不同模块原语的并行化与流水线优化,并通过增强调度机制提升应用级性能。基于此,我们构建了Teola——一个面向LLM应用的新型编排框架。综合实验表明,在多种主流LLM应用中,Teola相比现有系统最高可实现2.09倍的加速。