Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online: https://bonany.cc/attentionleg.
翻译:基于Transformer架构的大语言模型(LLMs)已在自然语言处理、多模态生成式人工智能和面向代理的人工智能领域展现出卓越性能。自注意力模块是Transformer类LLMs中最核心的子结构。采用通用图形处理器(GPU)进行计算时,处理单元与存储器之间传输中间计算结果所需的数据带宽需求极为庞大。为解决这一挑战,本研究开发了完全定制化的朴素自注意力加速器AttentionLego,作为构建空间可扩展LLM处理器的基础模块。AttentionLego通过融合存内计算(PIM)技术的全定制数字逻辑实现基础功能,其核心设计基于PIM型矩阵向量乘法与查表式Softmax方案。相关开源代码已发布于:https://bonany.cc/attentionleg。