With the advances in deep learning, the performance of end-to-end (E2E) single-task models for speech and audio processing has been constantly improving. However, it is still challenging to build a general-purpose model with high performance on multiple tasks, since different speech and audio processing tasks usually require different training data, input features, or model architectures to achieve optimal performance. In this work, MT2KD, a novel two-stage multi-task learning framework is proposed to build a general-purpose speech and audio encoder that jointly performs three fundamental tasks: automatic speech recognition (ASR), audio tagging (AT) and speaker verification (SV). In the first stage, multi-teacher knowledge distillation (KD) is applied to align the feature spaces of three single-task high-performance teacher encoders into a single student encoder using the same unlabelled data. In the second stage, multi-task supervised fine-tuning is carried out by initialising the model from the first stage and training on the separate labelled data of each single task. Experiments demonstrate that the proposed multi-task training pipeline significantly outperforms a baseline model trained with multi-task learning from scratch. The final system achieves good performance on ASR, AT and SV: with less than 4% relative word-error-rate increase on ASR, only 1.9 lower mean averaged precision on AT and 0.23% absolute higher equal error rate on SV compared to the best-performing single-task encoders, using only a 66M total model parameters.
翻译:随着深度学习的发展,端到端单任务模型在语音与音频处理中的性能不断提升。然而,构建一个在多项任务上均表现优异的通用模型仍具挑战,因为不同的语音与音频处理任务通常需要不同的训练数据、输入特征或模型架构以实现最优性能。本文提出MT2KD——一种新颖的两阶段多任务学习框架,旨在构建一个通用语音与音频编码器,可同时执行三项基础任务:自动语音识别、音频标记与说话人验证。在第一阶段,采用多教师知识蒸馏方法,利用相同的无标注数据将三个高性能单任务教师编码器的特征空间对齐至单一学生编码器中。在第二阶段,通过初始化第一阶段获得的模型,并在各任务的独立标注数据上进行训练,实现多任务监督微调。实验表明,所提出的多任务训练流程显著优于从头开始进行多任务学习的基线模型。最终系统在ASR、AT和SV任务上均取得良好性能:与最佳单任务编码器相比,仅使用6600万总参数量,即在ASR任务上词错误率相对增加小于4%,在AT任务上平均精度均值仅降低1.9,在SV任务上等错误率绝对增加0.23%。