In this paper,an Enhanced Self-Attention (ESA) mechanism has been put forward for robust feature extraction.The proposed ESA is integrated with the recursive gated convolution and self-attention mechanism.In particular, the former is used to capture multi-order feature interaction and the latter is for global feature extraction.In addition, the location of interest that is suitable for inserting the ESA is also worth being explored.In this paper, the ESA is embedded into the encoder layer of the Transformer network for automatic speech recognition (ASR) tasks, and this newly proposed model is named GNCformer. The effectiveness of the GNCformer has been validated using two datasets, that are Aishell-1 and HKUST.Experimental results show that, compared with the Transformer network,0.8%CER,and 1.2%CER improvement for these two mentioned datasets, respectively, can be achieved.It is worth mentioning that only 1.4M additional parameters have been involved in our proposed GNCformer.
翻译:本文提出了一种增强自注意力机制(Enhanced Self-Attention,ESA),用于鲁棒的特征提取。所提出的ESA结合了递归门控卷积与自注意力机制:前者用于捕获多阶特征交互,后者则用于全局特征提取。此外,适合插入ESA的感兴趣位置也值得探索。本文将ESA嵌入Transformer网络的编码器层中,用于自动语音识别(ASR)任务,并将这一新提出的模型命名为GNCformer。通过在Aishell-1和HKUST两个数据集上的实验验证了GNCformer的有效性。实验结果表明,与Transformer网络相比,所提模型在以上两个数据集上分别实现了0.8%的词错误率(CER)和1.2%的词错误率(CER)改进。值得一提的是,所提出的GNCformer仅增加了1.4M个额外参数。