Fusion Former: 用于高效流语音识别的变换器中的引信操作 (FusionFormer: Fusing Operations in Transformer for Efficient Streaming Speech Recognition)

The recently proposed Conformer architecture which combines convolution with attention to capture both local and global dependencies has become the \textit{de facto} backbone model for Automatic Speech Recognition~(ASR). Inherited from the Natural Language Processing (NLP) tasks, the architecture takes Layer Normalization~(LN) as a default normalization technique. However, through a series of systematic studies, we find that LN might take 10\% of the inference time despite that it only contributes to 0.1\% of the FLOPs. This motivates us to replace LN with other normalization techniques, e.g., Batch Normalization~(BN), to speed up inference with the help of operator fusion methods and the avoidance of calculating the mean and variance statistics during inference. After examining several plain attempts which directly remove all LN layers or replace them with BN in the same place, we find that the divergence issue is mainly caused by the unstable layer output. We therefore propose to append a BN layer to each linear or convolution layer where stabilized training results are observed. We also propose to simplify the activations in Conformer, such as Swish and GLU, by replacing them with ReLU. All these exchanged modules can be fused into the weights of the adjacent linear/convolution layers and hence have zero inference cost. Therefore, we name it FusionFormer. Our experiments indicate that FusionFormer is as effective as the LN-based Conformer and is about 10\% faster.

翻译：最近提出的连接结构将融合与捕捉本地和全球依赖性相结合,这已成为自动语音识别(ASR)的骨干模型。从自然语言处理(NLP)任务中继承的架构将层正常化~(LN)作为一种默认的正常化技术。然而,通过一系列系统研究,我们发现LN可能要花10个百分点的推论时间,尽管它只会促成FLOP的0.1 。这促使我们用其他正常化技术取代LN,例如,批量正常化~(BN),在操作者整合方法的帮助下加快推论速度,避免在推论期间计算平均值和差异统计数据。在研究了直接删除所有LN层或在同一地点将其替换为BN的几项尝试之后,我们发现差异问题主要是由不稳定的层输出造成的。因此我们提议在观察到稳定培训结果的每个线性或变异层中加入一个BN级。我们还提议在操作者集中简化Conforlorlorloral的升级和变异级的升级,以SworldLForloral的名称取代了我们Foral的升级的升级和升级。我们Forloral的升级的升级的升级的升级的升级和升级的升级的升级。我们的升级的升级的升级的升级和升级的升级的升级的升级的升级的升级和升级的升级的升级的升级的升级的升级成本。我们可以以Srforldorldorldorldorldorldorldorldoral的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级。我们的升级的升级的升级。我们的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的