Transformer-based models excel in speech recognition. Existing efforts to optimize Transformer inference, typically for long-context applications, center on simplifying attention score calculations. However, streaming speech recognition models usually process a limited number of tokens each time, making attention score calculation less of a bottleneck. Instead, the bottleneck lies in the linear projection layers of multi-head attention and feedforward networks, constituting a substantial portion of the model size and contributing significantly to computation, memory, and power usage. To address this bottleneck, we propose folding attention, a technique targeting these linear layers, significantly reducing model size and improving memory and power efficiency. Experiments on on-device Transformer-based streaming speech recognition models show that folding attention reduces model size (and corresponding memory consumption) by up to 24% and power consumption by up to 23%, all without compromising model accuracy or computation overhead.
翻译:基于Transformer的模型在语音识别中表现出色。现有针对Transformer推理的优化工作,通常针对长上下文应用,侧重于简化注意力分数计算。然而,流式语音识别模型每次通常只处理有限数量的词元,使得注意力分数计算不再是瓶颈。相反,瓶颈在于多头注意力和前馈网络的线性投影层,这些层占用了模型规模的绝大部分,并对计算、内存和功耗贡献显著。为解决这一瓶颈,我们提出折叠注意力,一种针对这些线性层的技术,可显著减小模型规模并提升内存与功耗效率。在设备端基于Transformer的流式语音识别模型上的实验表明,折叠注意力可在不牺牲模型精度或计算开销的前提下,将模型规模(及相应的内存消耗)降低高达24%,功耗降低高达23%。