The transformer is a widely-used building block in modern neural networks. However, when applied to audio data, the transformer's acausal behaviour, which we term Acausal Attention (AA), has generally limited its application to offline tasks. In this paper we introduce Streaming Attention (SA), which operates causally with fixed latency, and requires lower compute and memory resources than AA to train. Next, we introduce Low Latency Streaming Attention (LLSA), a method which combines multiple SA layers without latency build-up proportional to the layer count. Comparative analysis between AA, SA and LLSA on Automatic Speech Recognition (ASR) and Speech Emotion Recognition (SER) tasks are presented. The results show that causal SA-based networks with fixed latencies of a few seconds (e.g. 1.8 seconds) and LLSA networks with latencies as short as 300 ms can perform comparably with acausal (AA) networks. We conclude that SA and LLSA methods retain many of the benefits of conventional acausal transformers, but with latency characteristics that make them practical to run in real-time streaming applications.
翻译:Transformer是现代神经网络中广泛使用的基础模块。然而,当应用于音频数据时,Transformer的非因果行为(我们称之为非因果注意力AA)通常将其应用局限于离线任务。本文提出了流式注意力机制SA,该机制以固定延迟进行因果运算,且训练时所需的计算和内存资源低于AA。接着,我们引入了低延迟流式注意力机制LLSA,这是一种能将多个SA层组合起来、且延迟不会随层数累积的方法。我们展示了AA、SA和LLSA在自动语音识别(ASR)和语音情感识别(SER)任务上的对比分析。结果表明,基于因果SA的网络(固定延迟为几秒,例如1.8秒)以及延迟短至300毫秒的LLSA网络能够取得与非因果(AA)网络相当的性能。我们得出结论:SA和LLSA方法保留了传统非因果Transformer的诸多优势,但其延迟特性使其能够实际应用于实时流式场景。