State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)
翻译:当前最先进的自动语音识别系统通过分别建模局部交互和全局交互取得了显著成果。前者可以高效计算,但后者通常通过注意力机制实现,对于长输入序列而言计算代价高昂。本文通过将具有线性复杂度的注意力高效替代方案HyperMixer扩展至用于语音识别的Conformer架构,由此提出HyperConformer以解决该问题。具体而言,多头HyperConformer在推理速度、内存占用、参数量及可用训练数据方面均比Conformer更高效,同时实现了相当或更优的识别性能。在Librispeech测试集clean子集上,HyperConformer以不到800万神经参数和5.7GB训练峰值内存取得2.9%的词错误率,因此可在常规硬件上训练。与等效Conformer相比,其中等长度语音编码器速度提升38%,长语音编码器速度提升56%。(HyperConformer方案已在https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/ 公开提供。)