RRAM crossbars have been studied to construct in-memory accelerators for neural network applications due to their in-situ computing capability. However, prior RRAM-based accelerators show efficiency degradation when executing the popular attention models. We observed that the frequent softmax operations arise as the efficiency bottleneck and also are insensitive to computing precision. Thus, we propose STAR, which boosts the computing efficiency with an efficient RRAM-based softmax engine and a fine-grained global pipeline for the attention models. Specifically, STAR exploits the versatility and flexibility of RRAM crossbars to trade off the model accuracy and hardware efficiency. The experimental results evaluated on several datasets show STAR achieves up to 30.63x and 1.31x computing efficiency improvements over the GPU and the state-of-the-art RRAM-based attention accelerators, respectively.
翻译:RRAM交叉阵列凭借其原位计算能力,已被研究用于构建面向神经网络应用的内存加速器。然而,现有基于RRAM的加速器在执行主流注意力模型时存在效率下降的问题。我们观察到频繁的Softmax运算构成了效率瓶颈,同时对计算精度不敏感。为此,我们提出STAR,通过为注意力模型设计基于RRAM的高效Softmax引擎和细粒度全局流水线来提升计算效率。具体而言,STAR利用RRAM交叉阵列的通用性与灵活性,在模型精度与硬件效率之间进行权衡。在多个数据集上的实验结果表明,与GPU及当前最先进的基于RRAM的注意力加速器相比,STAR的计算效率分别提升达30.63倍和1.31倍。