Transformer-based end-to-end speech recognition has achieved great success. However, the large footprint and computational overhead make it difficult to deploy these models in some real-world applications. Model compression techniques can reduce the model size and speed up inference, but the compressed model has a fixed architecture which might be suboptimal. We propose a novel Transformer encoder with Input-Dependent Dynamic Depth (I3D) to achieve strong performance-efficiency trade-offs. With a similar number of layers at inference time, I3D-based models outperform the vanilla Transformer and the static pruned model via iterative layer pruning. We also present interesting analysis on the gate probabilities and the input-dependency, which helps us better understand deep encoders.
翻译:基于Transformer的端到端语音识别已取得巨大成功。然而,由于模型体积庞大且计算开销过高,这些模型在实际应用中难以部署。模型压缩技术虽能减小模型尺寸并加速推理,但压缩后的模型架构固定,可能并非最优方案。为此,我们提出一种具有输入相关动态深度(I3D)的新型Transformer编码器,以实现性能与效率的强效权衡。在推理时层数相近的情况下,基于I3D的模型性能优于标准Transformer及通过逐层剪枝获得的静态剪枝模型。我们还对门控概率及输入相关性展开了深入分析,这有助于我们更好地理解深层编码器。