The Transformer architecture has proven to be highly effective for Automatic Speech Recognition (ASR) tasks, becoming a foundational component for a plethora of research in the domain. Historically, many approaches have leaned on fixed-length attention windows, which becomes problematic for varied speech samples in duration and complexity, leading to data over-smoothing and neglect of essential long-term connectivity. Addressing this limitation, we introduce Echo-MSA, a nimble module equipped with a variable-length attention mechanism that accommodates a range of speech sample complexities and durations. This module offers the flexibility to extract speech features across various granularities, spanning from frames and phonemes to words and discourse. The proposed design captures the variable length feature of speech and addresses the limitations of fixed-length attention. Our evaluation leverages a parallel attention architecture complemented by a dynamic gating mechanism that amalgamates traditional attention with the Echo-MSA module output. Empirical evidence from our study reveals that integrating Echo-MSA into the primary model's training regime significantly enhances the word error rate (WER) performance, all while preserving the intrinsic stability of the original model.
翻译:Transformer架构已被证明在自动语音识别(ASR)任务中具有高效性,成为该领域大量研究的基础组件。历史上,许多方法依赖固定长度的注意力窗口,这在处理时长与复杂度各异的语音样本时会产生问题,导致数据过度平滑并忽略关键的长期关联性。为解决这一局限,我们引入了Echo-MSA——一种配备变长注意力机制的轻量化模块,可适应不同语音样本的复杂度与时长。该模块具备灵活提取不同粒度(从帧、音素到词汇及语篇)语音特征的能力。所提出的设计捕捉了语音的变长特性,并克服了固定长度注意力的局限性。我们的评估采用并行注意力架构,并辅以动态门控机制,将传统注意力与Echo-MSA模块的输出进行融合。实验结果表明,将Echo-MSA集成至主模型训练框架,可在保持原始模型固有稳定性的同时,显著提升词错误率(WER)性能。