In this paper, we propose a bilevel joint unsupervised and supervised training (BL-JUST) framework for automatic speech recognition. Compared to the conventional pre-training and fine-tuning strategy which is a disconnected two-stage process, BL-JUST tries to optimize an acoustic model such that it simultaneously minimizes both the unsupervised and supervised loss functions. Because BL-JUST seeks matched local optima of both loss functions, acoustic representations learned by the acoustic model strike a good balance between being generic and task-specific. We solve the BL-JUST problem using penalty-based bilevel gradient descent and evaluate the trained deep neural network acoustic models on various datasets with a variety of architectures and loss functions. We show that BL-JUST can outperform the widely-used pre-training and fine-tuning strategy and some other popular semi-supervised techniques.
翻译:本文提出了一种用于自动语音识别的双层联合无监督与监督训练(BL-JUST)框架。相较于传统预训练与微调策略中相互割裂的两阶段流程,BL-JUST旨在优化声学模型以同时最小化无监督与监督损失函数。由于BL-JUST寻求两种损失函数的匹配局部最优解,声学模型学习到的声学表征能在通用性与任务特异性之间达到良好平衡。我们采用基于惩罚项的双层梯度下降法求解BL-JUST问题,并在多种数据集上使用不同架构与损失函数评估训练后的深度神经网络声学模型。实验表明,BL-JUST能够超越广泛使用的预训练-微调策略及其他主流半监督技术。