In the rapidly evolving field of deep learning, the performance of model inference has become a pivotal aspect as models become more complex and are deployed in diverse applications. Among these, autoregressive models stand out due to their state-of-the-art performance in numerous generative tasks. These models, by design, harness a temporal dependency structure, where the current token's probability distribution is conditioned on preceding tokens. This inherently sequential characteristic, however, adheres to the Markov Chain assumption and lacks temporal parallelism, which poses unique challenges. Particularly in industrial contexts where inference requests, following a Poisson time distribution, necessitate diverse response lengths, this absence of parallelism is more profound. Existing solutions, such as dynamic batching and concurrent model instances, nevertheless, come with severe overheads and a lack of flexibility, these coarse-grained methods fall short of achieving optimal latency and throughput. To address these shortcomings, we propose Flavor -- a temporal fusion framework for efficient inference in autoregressive models, eliminating the need for heuristic settings and applies to a wide range of inference scenarios. By providing more fine-grained parallelism on the temporality of requests and employing an efficient memory shuffle algorithm, Flover achieves up to 11x faster inference on GPT models compared to the cutting-edge solutions provided by NVIDIA Triton FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to multi-node scenarios, thereby offering robust performance optimization that transcends hardware boundaries.
翻译:在深度学习快速发展的领域中,随着模型日益复杂并部署于多样化应用场景,模型推理性能已成为关键环节。其中,自回归模型因其在众多生成任务中展现出的顶尖性能而备受瞩目。这类模型天然利用时间依赖结构,当前token的概率分布基于先前token进行条件建模。然而,这种固有的序列特性遵循马尔可夫链假设并缺乏时间并行性,带来了独特挑战。尤其在工业场景中,遵循泊松时间分布的推理请求需要多样化响应长度,这种并行性缺失的影响更为深刻。现有解决方案(如动态批处理与并发模型实例)存在严重开销且缺乏灵活性,这些粗粒度方法难以实现最优延迟与吞吐量。为应对上述不足,我们提出Flover——一种面向自回归模型高效推理的时间融合框架,无需启发式设置即可适用于广泛推理场景。通过为请求的时间性提供更细粒度并行性,并采用高效内存洗牌算法,Flover在GPT模型上相较于NVIDIA Triton FasterTransformer提供的尖端方案实现了高达11倍的推理加速。关键的是,通过利用先进张量并行技术,Flover在从单GPU到多节点的多样化计算环境中均展现出卓越效能,从而突破了硬件边界提供稳健的性能优化。