We present ALTO, a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. ALTO achieves high throughput and low latency by taking advantage of an optimization opportunity specific to generative language models: streaming intermediate outputs. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances. We also motivate the need for an aggregation-aware routing interface and distributed prompt-aware scheduling to address these challenges. We demonstrate the impact of ALTO's partial output streaming on a complex chatbot verification pipeline, increasing throughput by up to 3x for a fixed latency target of 4 seconds / request while also reducing tail latency by 1.8x compared to a baseline serving approach.
翻译:我们提出ALTO,一种用于高效服务复合AI系统(如语言模型流水线)的网络编排器。ALTO利用生成式语言模型特有的优化机会——流式中间输出来实现高吞吐量和低延迟。当语言模型逐token生成输出时,ALTO在可能的情况下暴露跨阶段流式传输中间输出的机会。我们强调了在跨分布式流水线阶段实例间流式传输中间数据时出现的两个新挑战:正确性和负载均衡。我们还论证了为解决这些挑战而需要聚合感知路由接口和分布式提示感知调度方案的必要性。我们展示了ALTO的部分输出流式传输在复杂聊天机器人验证流水线上的影响:在保持每请求4秒的固定延迟目标下,与基准服务方法相比,吞吐量提升高达3倍,同时尾部延迟降低1.8倍。