Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
翻译:长期以来,语音模型在许多分类任务中容易对特定说话人产生过拟合,这导致其在说话人超出训练域或分布外的场景中泛化能力较差,而此类情况在生产环境中十分常见。本文将说话人适应视为少样本学习问题,并借鉴自然语言处理中预训练模型的最新成功经验,提出研究适用于该场景的迁移学习方法。我们提出通过对语音模型进行困难任务的预微调,将知识提炼到少样本下游分类任务中。我们在四个多类别情感语音识别数据集的全部排列组合上对Wav2Vec2.0进行预微调,并通过在情感语音数据集上进行的33,600次少样本微调实验来评估预微调模型的效果。