As large language models (LLMs) evolve to handle increasingly longer contexts, serving inference requests for context lengths in the range of millions of tokens presents unique challenges. While existing techniques are effective for training, they fail to address the unique challenges of inference, such as varying prefill and decode phases and their associated latency constraints - like Time to First Token (TTFT) and Time Between Tokens (TBT). Furthermore, there are no long context inference solutions that allow batching requests to increase the hardware utilization today. In this paper, we propose three key innovations for efficient interactive long context LLM inference, without resorting to any approximation: adaptive chunking to reduce prefill overheads in mixed batching, Sequence Pipeline Parallelism (SPP) to lower TTFT, and KV Cache Parallelism (KVP) to minimize TBT. These contributions are combined into a 3D parallelism strategy, enabling Mnemosyne to scale interactive inference to context lengths at least up to 10 million tokens with high throughput enabled with batching. To our knowledge, Mnemosyne is the first to be able to achieve support for 10 million long context inference efficiently, while satisfying production-grade SLOs on TBT (30ms) on contexts up to and including 10 million.
翻译:随着大语言模型(LLM)处理日益增长的上下文长度,为百万量级令牌的上下文提供推理服务带来了独特挑战。现有技术虽对训练有效,却未能解决推理阶段的特殊难题,例如变化的前填充和解码阶段及其相关的延迟约束——如首令牌生成时间(TTFT)和令牌间生成时间(TBT)。此外,目前尚无长上下文推理解决方案能够通过批处理请求来提高硬件利用率。本文提出了三项关键创新,以实现高效的交互式长上下文LLM推理,且无需任何近似处理:通过自适应分块降低混合批处理中的前填充开销,采用序列流水线并行(SPP)以降低TTFT,以及引入KV缓存并行(KVP)以最小化TBT。这些贡献被整合为一种三维并行策略,使得Mnemosyne能够将交互式推理扩展到至少1000万令牌的上下文长度,并通过批处理实现高吞吐量。据我们所知,Mnemosyne是首个能够高效支持1000万长上下文推理的系统,同时在高达1000万的上下文长度上满足生产级TBT服务等级目标(30ms)。