Rotary Position Embedding (RoPE) encodes relative and absolute positional information in Transformer-based models through rotation matrices applied to input vectors within sequences. While RoPE has demonstrated superior performance compared to other positional embedding technologies in natural language processing tasks, its effectiveness in speech processing applications remains understudied. In this work, we conduct a comprehensive evaluation of RoPE across diverse automatic speech recognition (ASR) tasks. Our experimental results demonstrate that for ASR tasks, RoPE consistently achieves lower error rates compared to the currently widely used relative positional embedding. To facilitate further research, we release the implementation and all experimental recipes through the SpeechBrain toolkit.
翻译:旋转位置嵌入(RoPE)通过将旋转矩阵应用于序列中的输入向量,在基于Transformer的模型中编码相对和绝对位置信息。尽管RoPE在自然语言处理任务中相较于其他位置嵌入技术已展现出优越性能,其在语音处理应用中的有效性仍缺乏深入研究。本文中,我们对RoPE在多种自动语音识别(ASR)任务中进行了全面评估。实验结果表明,对于ASR任务,RoPE相较于当前广泛使用的相对位置嵌入技术,能够持续实现更低的错误率。为促进进一步研究,我们通过SpeechBrain工具包开源了实现代码及所有实验方案。