A low latency attention module for streaming self-supervised speech representation learning

The transformer is a fundamental building block in deep learning, and the attention mechanism is the transformer's core component. Self-supervised speech representation learning (SSRL) represents a popular use-case for the transformer architecture. Due to transformers' acausal behavior, the use of transformers for SSRL has been predominantly focused on acausal applications. However, several media processing problems, such as speech processing, require real-time solutions. In this paper, we present an implementation of the attention module that enables training of SSRL architectures with low compute and memory requirements, while allowing real-time inference with low and fixed latency. The attention module proposed in this paper includes two components, streaming attention (SA) and low-latency streaming attention (LLSA). The SA represents our proposal for an efficient streaming SSRL implementation, while the LLSA solves the latency build-up problem of other streaming attention architectures, such as the masked acausal attention (MAA), guaranteeing a latency equal to one layer even when multiple layers are stacked. We present a comparative analysis between the vanilla attention, which we will refer here as acausal attention (AA), the SA, and the LLSA, by training a streaming SSRL with automatic speech recognition as downstream task. When training on librispeech-clean-100 and testing on librispeech-test-clean, our low-latency attention module has a word error rate (WER) of 5.84%, which represents a significant improvement over the MAA (WER = 13.82%). Our implementation also reduces the inference latency from 1.92 to 0.16 seconds. The proposed low-latency module preserves many of the benefits of conventional acausal transformers, but also enables latency characteristics that make it applicable to real-time streaming applications.

翻译：Transformer是深度学习中的基础构建模块，而注意力机制是Transformer的核心组件。自监督语音表示学习（SSRL）是Transformer架构的典型应用场景。由于Transformer的非因果特性，其在SSRL中的应用主要集中于非因果场景。然而，语音处理等多项媒体处理任务需要实时解决方案。本文提出一种注意力模块实现方案，该方案能够以低计算和内存需求训练SSRL架构，同时支持低且固定的延迟实时推理。本文提出的注意力模块包含两个组件：流式注意力（SA）和低延迟流式注意力（LLSA）。SA代表了我们对于高效流式SSRL实现的方案，而LLSA则解决了其他流式注意力架构（如掩蔽非因果注意力MAA）存在的延迟累积问题，确保即使在多层堆叠时，延迟仍保持与单层一致。我们通过训练以自动语音识别为下游任务的流式SSRL，对标准注意力（本文称为非因果注意力AA）、SA和LLSA进行了对比分析。在librispeech-clean-100上训练并在librispeech-test-clean上测试时，我们的低延迟注意力模块词错误率（WER）为5.84%，显著优于MAA（WER=13.82%）。我们的实现还将推理延迟从1.92秒降低至0.16秒。所提出的低延迟模块不仅保留了传统非因果Transformer的诸多优势，还具备适用于实时流式应用的延迟特性。