Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60\% useless frames during model training and inference, which will accelerate the inference speed significantly. This work code is available in {https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}

翻译：近期，Conformer作为端到端自动语音识别的骨干网络取得了业界领先的性能。Conformer模块利用自注意力机制捕获全局信息，同时结合卷积神经网络捕获局部信息，从而提升了识别性能。然而，基于Conformer的模型面临自注意力机制的计算复杂度随输入序列长度呈二次增长的问题。受先前解码过程中基于连接主义时序分类（CTC）的空白跳转策略启发，我们将中间CTC输出作为指导信号引入Conformer编码器的下采样过程。我们将具有非空白输出的帧定义为关键帧。具体而言，我们提出基于关键帧的自注意力（KFSA）机制，这是一种利用关键帧降低自注意力计算量的新颖方法。所提方法的结构包含两个编码器：在初始编码器之后，我们引入中间CTC损失函数来计算标签帧，从而为KFSA提取关键帧和空白帧。此外，我们提出基于关键帧的下采样（KFDS）机制，该机制直接对高维声学特征进行操作，丢弃与空白标签对应的帧，生成新的声学特征序列作为第二个编码器的输入。实验表明，所提方法在性能上与原始Conformer及同类工作（如Efficient Conformer）相当或更优。同时，我们的方法在模型训练和推理过程中可丢弃超过60%的无效帧，显著提升推理速度。本工作代码详见{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}