Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI across various industries. Industry and research community have witnessed a large number of new applications, based on those foundation models. Such applications include question and answer, customer services, image and video generation, and code completions, among others. However, as the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios. As a result, the demand for cost-effective and fast inference using AI accelerators is ever more higher. To this end, our tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators. Beginning with an overview of basic Transformer architectures and deep learning system frameworks, we deep dive into system optimization techniques for fast and memory-efficient attention computations and discuss how they can be implemented efficiently on AI accelerators. Next, we describe architectural elements that are key for fast transformer inference. Finally, we examine various model compression and fast decoding strategies in the same context.
翻译:以Transformer架构为核心的强大基础模型,包括大语言模型(LLM),已在各行业引领了生成式人工智能的新纪元。工业界和学术界已见证了基于这些基础模型的大量新应用,例如问答系统、客户服务、图像与视频生成以及代码补全等。然而,随着模型参数量达到数千亿规模,其在实际场景中的部署带来了极高的推理成本和难以接受的延迟。因此,利用AI加速器实现高性价比、低延迟推理的需求日益迫切。为此,本教程系统性地探讨了基于AI加速器的互补性推理优化技术。首先概述基本的Transformer架构与深度学习系统框架,随后深入探究面向快速且内存高效注意力计算的系统优化技术,并讨论如何在AI加速器上高效实现这些技术。接着,我们阐述实现快速Transformer推理的关键架构要素。最后,我们在同一背景下审视多种模型压缩与快速解码策略。