This paper proposes a method to effectively perform joint training-and-pruning based on adaptive dropout layers with unit-wise retention probabilities. The proposed method is based on the estimation of a unit-wise retention probability in a dropout layer. A unit that is estimated to have a small retention probability can be considered to be prunable. The retention probability of the unit is estimated using back-propagation and the Gumbel-Softmax technique. This pruning method is applied at several application points in Conformers such that the effective number of parameters can be significantly reduced. Specifically, adaptive dropout layers are introduced in three locations in each Conformer block: (a) the hidden layer of the feed-forward-net component, (b) the query vectors and the value vectors of the self-attention component, and (c) the input vectors of the LConv component. The proposed method is evaluated by conducting a speech recognition experiment on the LibriSpeech task. It was shown that this approach could simultaneously achieve a parameter reduction and accuracy improvement. The word error rates improved by approx 1% while reducing the number of parameters by 54%.
翻译:本文提出一种基于具有单元级保留概率的自适应丢弃层来有效执行联合训练与剪枝的方法。所提方法基于对丢弃层中单元级保留概率的估计。被估计具有较小保留概率的单元可被视为可剪枝单元。该单元的保留概率通过反向传播与Gumbel-Softmax技术进行估计。此剪枝方法应用于Conformer中的多个操作点,从而能显著减少有效参数量。具体而言,在每个Conformer模块的三个位置引入了自适应丢弃层:(a)前馈网络组件的隐藏层,(b)自注意力组件的查询向量与值向量,以及(c)LConv组件的输入向量。通过在LibriSpeech任务上进行语音识别实验对所提方法进行评估。结果表明该方法能同时实现参数量减少与精度提升。在参数量减少54%的同时,词错误率改善了约1%。