Autoregressive models, despite their commendable performance in a myriad of generative tasks, face challenges stemming from their inherently sequential structure. Inference on these models, by design, harnesses a temporal dependency, where the current token's probability distribution is conditioned on preceding tokens. This inherent characteristic severely impedes computational efficiency during inference as a typical inference request can require more than thousands of tokens, where generating each token requires a load of entire model weights, making the inference more memory-bound. The large overhead becomes profound in real deployment where requests arrive randomly, necessitating various generation lengths. Existing solutions, such as dynamic batching and concurrent instances, introduce significant response delays and bandwidth contention, falling short of achieving optimal latency and throughput. To address these shortcomings, we propose Flover -- a temporal fusion framework for efficiently inferring multiple requests in parallel. We deconstruct the general generation pipeline into pre-processing and token generation, and equip the framework with a dedicated work scheduler for fusing the generation process temporally across all requests. By orchestrating the token-level parallelism, Flover exhibits optimal hardware efficiency and significantly spares the system resources. By further employing a fast buffer reordering algorithm that allows memory eviction of finished tasks, it brings over 11x inference speedup on GPT and 16x on LLAMA compared to the cutting-edge solutions provided by NVIDIA FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to distributed scenarios, thereby offering robust performance optimization that adapts to variable use cases.
翻译:自回归模型虽然在各类生成任务中表现出色,但其固有的顺序结构带来了挑战。这类模型的推理过程本质上是利用时间依赖性,即当前词元的概率分布取决于之前生成的词元。这一固有特性严重制约了推理计算效率,因为单个推理请求可能需要生成数千个词元,而每个词元的生成都需加载完整模型权重,使得推理过程受内存限制严重。在实际部署中,由于请求随机到达且所需生成长度各异,巨大的推理开销问题愈发突出。现有解决方案(如动态批处理和多实例并行)会引入显著的响应延迟和带宽争用,难以同时实现最优的延迟和吞吐量。针对这些问题,本文提出Flover——一种能够高效并行推理多个请求的时间融合框架。我们将通用生成流水线分解为预处理和词元生成两个阶段,并为框架配备专用任务调度器,使其能够对所有请求的生成过程进行时间维度上的融合。通过精心设计词元级并行策略,Flover实现了最优的硬件效率并显著节省系统资源。进一步采用快速缓冲区重排序算法以支持已完成任务的内存驱逐后,与NVIDIA FasterTransformer提供的最先进解决方案相比,该框架在GPT上实现11倍以上的推理加速,在LLAMA上实现16倍加速。尤为重要的是,通过利用先进张量并行技术,Flover在从单GPU到分布式场景的多样化计算环境中均展现出有效性,提供了适应多种使用场景的鲁棒性能优化。