Structured Speaker-Deficiency Adaptation of Foundation Models for Dysarthric and Elderly Speech Recognition

Data-intensive fine-tuning of speech foundation models (SFMs) to scarce and diverse dysarthric and elderly speech leads to data bias and poor generalization to unseen speakers. This paper proposes novel structured speaker-deficiency adaptation approaches for SSL pre-trained SFMs on such data. Speaker and speech deficiency invariant SFMs were constructed in their supervised adaptive fine-tuning stage to reduce undue bias to training data speakers, and serves as a more neutral and robust starting point for test time unsupervised adaptation. Speech variability attributed to speaker identity and speech impairment severity, or aging induced neurocognitive decline, are modelled using separate adapters that can be combined together to model any seen or unseen speaker. Experiments on the UASpeech dysarthric and DementiaBank Pitt elderly speech corpora suggest structured speaker-deficiency adaptation of HuBERT and Wav2vec2-conformer models consistently outperforms baseline SFMs using either: a) no adapters; b) global adapters shared among all speakers; or c) single attribute adapters modelling speaker or deficiency labels alone by statistically significant WER reductions up to 3.01% and 1.50% absolute (10.86% and 6.94% relative) on the two tasks respectively. The lowest published WER of 19.45% (49.34% on very low intelligibility, 33.17% on unseen words) is obtained on the UASpeech test set of 16 dysarthric speakers.

翻译：在稀缺且多样的构音障碍与老年语音数据上对语音基础模型进行数据密集型微调，会导致数据偏差以及对未见说话人的泛化能力不佳。本文针对此类数据，为自监督学习预训练的语音基础模型提出了新颖的结构化说话人-缺陷自适应方法。在其监督式自适应微调阶段，构建了说话人与语音缺陷不变的语音基础模型，以减少对训练数据中说话人的不当偏差，并作为一个更中立、更鲁棒的起点，用于测试时的无监督自适应。由说话人身份、言语障碍严重程度或衰老引起的神经认知衰退所导致的语音变异性，通过使用独立的适配器进行建模，这些适配器可以组合在一起来建模任何已见或未见的说话人。在UASpeech构音障碍语音语料库和DementiaBank Pitt老年语音语料库上的实验表明，对HuBERT和Wav2vec2-conformer模型进行结构化说话人-缺陷自适应，其性能持续优于使用以下任一方式的基线语音基础模型：a) 不使用适配器；b) 在所有说话人之间共享的全局适配器；或 c) 仅建模说话人或缺陷标签的单属性适配器。在两个任务上分别实现了高达3.01%和1.50%绝对（10.86%和6.94%相对）的统计显著字错误率降低。在包含16位构音障碍说话人的UASpeech测试集上，获得了已发表的最低字错误率19.45%（在极低可懂度语音上为49.34%，在未见词汇上为33.17%）。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日