Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead.
翻译:基于Transformer的模型在语音识别领域表现出色。现有针对Transformer推理的优化工作,通常面向长上下文应用,主要侧重于简化注意力分数计算。然而,流式语音识别模型通常每次只处理有限数量的词元,这使得注意力分数计算不再是瓶颈。相反,瓶颈在于多头注意力和前馈网络中的线性投影层,这些层占据了模型大小的很大一部分,并且对计算、内存和功耗有显著贡献。为了解决这一瓶颈,我们提出了折叠注意力技术,该技术针对这些线性层,显著减小了模型大小,并提高了内存和功耗效率。在设备端基于Transformer的流式语音识别模型上的实验表明,折叠注意力在不影响模型精度或计算开销的情况下,将模型大小(及相应的内存消耗)降低了高达24%,功耗降低了高达23%。