Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
翻译:语音模型长期以来在多项分类任务中被发现对个体说话者存在过拟合现象,这导致在说话者超出域或分布外(常见于生产环境)时泛化能力不佳。我们将说话者适配视为少样本学习问题,并借鉴自然语言任务中预训练模型近期取得的成功经验,研究迁移学习方法。我们提出在困难任务上对语音模型进行预微调,以将知识蒸馏到少样本下游分类目标中。通过在四个多类情感语音识别语料库的所有排列组合上进行Wav2Vec2.0预微调,并基于情感语音数据集进行33,600次少样本微调实验评估预微调模型性能。