Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.
翻译:自监督学习(SSL)显著推动了语音处理技术的发展,但其自注意力机制导致的二次复杂度问题依然存在。为应对此挑战,摘要混合(SM)作为一种线性时间替代方案被提出,该方法通过均值池化汇总整个语音片段,但缺乏足够的局部上下文信息。本研究提出窗口化摘要混合(WSM),在保留全局摘要的同时引入局部邻域摘要,在维持计算效率的同时增强了时序依赖性。此外,我们提出选择性微调策略:将SSL模型中的自注意力层替换为WSM模块,并在低资源场景下仅对这些模块进行微调。该方法在提升自动语音识别(ASR)性能的同时,将SSL模型的峰值显存使用量降低了40%。WSM模块具有线性时间复杂度与增强的上下文感知能力。选择性替换部分注意力层有效降低了计算量、内存占用和延迟,为低资源语音识别提供了理想解决方案。