In the rapidly evolving field of deep learning, the performance of model inference has become a pivotal aspect as models become more complex and are deployed in diverse applications. Among these, autoregressive models stand out due to their state-of-the-art performance in numerous generative tasks. These models, by design, harness a temporal dependency structure, where the current token's probability distribution is conditioned on preceding tokens. This inherently sequential characteristic, however, adheres to the Markov Chain assumption and lacks temporal parallelism, which poses unique challenges. Particularly in industrial contexts where inference requests, following a Poisson time distribution, necessitate diverse response lengths, this absence of parallelism is more profound. Existing solutions, such as dynamic batching and concurrent model instances, nevertheless, come with severe overheads and a lack of flexibility, these coarse-grained methods fall short of achieving optimal latency and throughput. To address these shortcomings, we propose Flavor -- a temporal fusion framework for efficient inference in autoregressive models, eliminating the need for heuristic settings and applies to a wide range of inference scenarios. By providing more fine-grained parallelism on the temporality of requests and employing an efficient memory shuffle algorithm, Flover achieves up to 11x faster inference on GPT models compared to the cutting-edge solutions provided by NVIDIA Triton FasterTransformer. Crucially, by leveraging the advanced tensor parallel technique, Flover proves efficacious across diverse computational landscapes, from single-GPU setups to multi-node scenarios, thereby offering robust performance optimization that transcends hardware boundaries.
翻译:在快速发展的深度学习领域中,随着模型日益复杂并部署于多样化应用场景,模型推理性能已成为关键因素。其中,自回归模型因其在众多生成任务中表现出的最先进性能而脱颖而出。这类模型在设计上利用了时间依赖结构,即当前token的概率分布取决于先前所有token。然而,这种固有的顺序特性遵循马尔可夫链假设,缺乏时间并行性,带来了独特的挑战。尤其在工业场景中,当推理请求服从泊松时间分布且需生成不同长度响应时,这种并行性的缺失尤为显著。现有解决方案(如动态批处理和并发模型实例)存在严重开销且灵活性不足,这些粗粒度方法难以实现最优延迟与吞吐量。为解决上述缺陷,我们提出Flover——一种面向自回归模型高效推理的时间融合框架,无需启发式设置即可适用于广泛推理场景。通过提供请求时间维度的细粒度并行性并采用高效内存重排算法,Flover在GPT模型上实现了比NVIDIA Triton FasterTransformer等尖端方案最高11倍的推理加速。尤为关键的是,通过利用先进张量并行技术,Flover在从单GPU到多节点的异构计算环境中均展现出显著效果,从而提供超越硬件边界的稳健性能优化。