Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.
翻译:在稀缺且多样的构音障碍与老年语音数据上对语音基础模型进行数据密集型微调,会导致数据偏差以及对未见说话人的泛化能力不佳。本文针对此类数据,为自监督学习预训练的语音基础模型提出了新颖的结构化说话人-缺陷自适应方法。在其监督式自适应微调阶段,构建了说话人与语音缺陷不变的语音基础模型,以减少对训练数据中说话人的不当偏差,并作为一个更中立、更鲁棒的起点,用于测试时的无监督自适应。由说话人身份、言语障碍严重程度或衰老引起的神经认知衰退所导致的语音变异性,通过使用独立的适配器进行建模,这些适配器可以组合在一起来建模任何已见或未见的说话人。在UASpeech构音障碍语音语料库和DementiaBank Pitt老年语音语料库上的实验表明,对HuBERT和Wav2vec2-conformer模型进行结构化说话人-缺陷自适应,其性能持续优于使用以下任一方式的基线语音基础模型:a) 不使用适配器;b) 在所有说话人之间共享的全局适配器;或 c) 仅建模说话人或缺陷标签的单属性适配器。在两个任务上分别实现了高达3.01%和1.50%绝对(10.86%和6.94%相对)的统计显著字错误率降低。在包含16位构音障碍说话人的UASpeech测试集上,获得了已发表的最低字错误率19.45%(在极低可懂度语音上为49.34%,在未见词汇上为33.17%)。