Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60\% useless frames during model training and inference, which will accelerate the inference speed significantly. This work code is available in {https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}

翻译：近期，Conformer作为端到端自动语音识别的骨干网络取得了最先进的性能。Conformer模块利用自注意力机制捕获全局信息，同时结合卷积神经网络捕获局部信息，从而提升了性能。然而，基于Conformer的模型在自注意力机制方面存在一个问题，即计算复杂度随输入序列长度呈二次增长。受先前基于连接时序分类（CTC）指导解码过程中空白跳过的启发，我们将中间CTC输出作为指导引入Conformer编码器的下采样过程。我们将非空白输出的帧定义为关键帧。具体来说，我们提出了基于关键帧的自注意力（KFSA）机制，这是一种利用关键帧减少自注意力机制计算量的新方法。我们提出的方法结构包含两个编码器。在初始编码器之后，我们引入中间CTC损失函数来计算标签帧，从而能够为KFSA提取关键帧和空白帧。此外，我们引入了基于关键帧的下采样（KFDS）机制，直接对高维声学特征进行操作，并丢弃对应空白标签的帧，从而生成新的声学特征序列作为第二个编码器的输入。通过使用所提出的方法，我们在性能上达到或超越了标准Conformer及其他类似工作（如Efficient Conformer）。同时，我们的方法在模型训练和推理过程中可以丢弃超过60%的无效帧，这将显著加速推理速度。该工作的代码可在{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}获取。