The attention mechanism in text generation is memory-bounded due to its sequential characteristics. Therefore, off-chip memory accesses should be minimized for faster execution. Although previous methods addressed this by pruning unimportant tokens, they fall short in selectively removing tokens with near-zero attention probabilities in each instance. Our method estimates the probability before the softmax function, effectively removing low probability tokens and achieving an 12.1x pruning ratio without fine-tuning. Additionally, we present a hardware design supporting seamless on-demand off-chip access. Our approach shows 2.6x reduced memory accesses, leading to an average 2.3x speedup and a 2.4x energy efficiency.
翻译:文本生成中的注意力机制因其顺序特性而受限于内存访问。因此,为加速执行,应尽量减少片外内存访问。尽管先前的方法通过剪枝不重要词元来解决此问题,但它们未能有效移除每个实例中注意力概率趋近于零的词元。本文方法在softmax函数前估计概率,有效移除低概率词元,在无需微调的情况下实现了12.1倍的剪枝率。此外,我们提出了一种支持无缝按需片外访问的硬件设计。实验表明,该方法使内存访问减少2.6倍,平均加速比达2.3倍,能效提升2.4倍。