A Heterogeneous Chiplet Architecture for Accelerating End-to-End Transformer Models

Transformers have revolutionized deep learning and generative modeling, enabling advancements in natural language processing tasks. However, the size of transformer models is increasing continuously, driven by enhanced capabilities across various deep learning tasks. This trend of ever-increasing model size has given rise to new challenges in terms of memory and compute requirements. Conventional computing platforms, including GPUs, suffer from suboptimal performance due to the memory demands imposed by models with millions/billions of parameters. The emerging chiplet-based platforms provide a new avenue for compute- and data-intensive machine learning (ML) applications enabled by a Network-on-Interposer (NoI). However, designing suitable hardware accelerators for executing Transformer inference workloads is challenging due to a wide variety of complex computing kernels in the Transformer architecture. In this paper, we leverage chiplet-based heterogeneous integration (HI) to design a high-performance and energy-efficient multi-chiplet platform to accelerate transformer workloads. We demonstrate that the proposed NoI architecture caters to the data access patterns inherent in a transformer model. The optimized placement of the chiplets and the associated NoI links and routers enable superior performance compared to the state-of-the-art hardware accelerators. The proposed NoI-based architecture demonstrates scalability across varying transformer models and improves latency and energy efficiency by up to 11.8x and 2.36x, respectively when compared with the existing state-of-the-art architecture HAIMA.

翻译：Transformer模型已彻底革新深度学习和生成式建模，推动了自然语言处理任务的发展。然而，随着各类深度学习任务性能的不断提升，Transformer模型的规模持续扩大。这种模型规模不断增长的趋势在内存和计算需求方面带来了新的挑战。传统计算平台（包括GPU）因需处理数百万/数十亿参数模型的内存需求而面临性能瓶颈。新兴的基于小芯片的平台通过硅中介层互连网络为计算密集型和数据密集型机器学习应用提供了新途径。然而，由于Transformer架构包含多种复杂的计算核心，为其推理工作负载设计合适的硬件加速器具有挑战性。本文利用基于小芯片的异构集成技术，设计了一种高性能、高能效的多小芯片平台以加速Transformer工作负载。我们证明所提出的硅中介层互连网络架构适配Transformer模型固有的数据访问模式。通过优化小芯片布局及相应的互连网络链路与路由器配置，该平台相比现有最先进的硬件加速器展现出更优越的性能。实验表明，所提出的基于硅中介层互连网络的架构在不同Transformer模型间具备良好可扩展性，与当前最先进的HAIMA架构相比，延迟和能效分别提升最高达11.8倍和2.36倍。