Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge efficient acceleration. To address these limitations, this paper proposes MatrixFlow, a novel co-designed system-accelerator architecture based on a loosely coupled systolic array including a new software mapping approach for efficient transformer code execution. MatrixFlow is co-optimized via a novel dataflow-based matrix multiplication technique that reduces memory overhead. These innovations significantly improve data throughput, which is critical for handling the extensive computations required by transformers. We validate our approach through full system simulation using gem5 across various BERT and ViT Transformer models featuring different data types, demonstrating significant application-wide speed-ups. Our method achieves up to a 22x improvement compared to a many-core CPU system, and outperforms the closest state-of-the-art loosely-coupled and tightly-coupled accelerators by over 5x and 8x, respectively.
翻译:Transformer是人工智能(AI)取得进展的核心,在从计算机视觉到自然语言处理等诸多领域均表现出色。尽管取得了成功,但其庞大的参数量与计算需求对高效加速提出了挑战。为应对这些局限,本文提出MatrixFlow——一种基于松耦合脉动阵列的新型协同设计系统-加速器架构,包含一种用于高效执行Transformer代码的新型软件映射方法。MatrixFlow通过一种基于数据流的新型矩阵乘法技术进行协同优化,从而降低了内存开销。这些创新显著提升了数据吞吐量,这对于处理Transformer所需的大规模计算至关重要。我们使用gem5对采用不同数据类型的多种BERT和ViT Transformer模型进行全系统仿真,验证了所提方法的有效性,并展示了其在应用层面带来的显著加速效果。与多核CPU系统相比,我们的方法最高可实现22倍的性能提升,并分别以超过5倍和8倍的优势优于最先进的松耦合与紧耦合加速器。