Transformer-based architectures are the most used architectures in many deep learning fields like Natural Language Processing, Computer Vision or Speech processing. It may encourage the direct use of Transformers in the constrained tasks, without questioning whether it will yield the same benefits as in standard tasks. Given specific constraints, it is essential to evaluate the relevance of transformer models. This work questions the suitability of transformers for specific domains. We argue that the high computational requirements and latency issues associated with these models do not align well with streaming applications. Our study promotes the search for alternative strategies to improve efficiency without sacrificing performance. In light of this observation, our paper critically examines the usefulness of transformer architecture in such constrained environments. As a first attempt, we show that the computational cost for Streaming Automatic Speech Recognition (ASR) can be reduced using deformable convolution instead of Self-Attention. Furthermore, we show that Self-Attention mechanisms can be entirely removed and not replaced, without observing significant degradation in the Word Error Rate.
翻译:Transformer架构已成为自然语言处理、计算机视觉和语音处理等众多深度学习领域中最常用的架构。这可能会促使研究人员在受限任务中直接使用Transformer,而无需质疑其是否能在标准任务中带来同等效益。针对特定约束条件,评估Transformer模型的相关性至关重要。本研究对Transformer在特定领域的适用性提出质疑。我们认为,这些模型的高计算需求和延迟问题与流式应用场景并不匹配。我们的研究倡导探索替代策略,在保持性能的同时提升效率。基于此观察,本文批判性地审视了Transformer架构在此类受限环境中的实用性。作为初步尝试,我们证明通过使用可变形卷积替代自注意力机制,能够降低流式自动语音识别(ASR)的计算成本。此外,我们发现自注意力机制可以被完全移除且无需替代方案,同时不会导致词错误率出现显著上升。