Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
翻译:语音模型长期以来在诸多分类任务中被发现过度拟合个体说话者,导致在说话者处于域外或分布外环境(这是生产环境中常见情况)时泛化能力不佳。我们将说话者适应视为少样本学习问题,并借鉴自然语言任务中预训练模型近期取得的成功,提出探索迁移学习方法。我们建议在困难任务上对语音模型进行预微调,以将知识提炼至少样本下游分类目标中。我们在四个多类情感语音语料库的所有排列组合上对Wav2Vec2.0进行预微调,并通过在情感语音数据集上进行的33,600次少样本微调实验来评估预微调模型。