Dysarthric speech exhibits high variability and limited labeled data, posing major challenges for both automatic speech recognition (ASR) and assistive speech technologies. Existing approaches rely on synthetic data augmentation or speech reconstruction, yet often entangle speaker identity with pathological articulation, limiting controllability and robustness. In this paper, we propose ProtoDisent-TTS, a prototype-based disentanglement TTS framework built on a pre-trained text-to-speech backbone that factorizes speaker timbre and dysarthric articulation within a unified latent space. A pathology prototype codebook provides interpretable and controllable representations of healthy and dysarthric speech patterns, while a dual-classifier objective with a gradient reversal layer enforces invariance of speaker embeddings to pathological attributes. Experiments on the TORGO dataset demonstrate that this design enables bidirectional transformation between healthy and dysarthric speech, leading to consistent ASR performance gains and robust, speaker-aware speech reconstruction.
翻译:构音障碍语音具有高度变异性和有限标注数据的特点,这对自动语音识别(ASR)和辅助语音技术均构成重大挑战。现有方法依赖于合成数据增强或语音重建,但常常将说话人身份与病理发音特征纠缠在一起,限制了可控性和鲁棒性。本文提出ProtoDisent-TTS,一种基于原型的解耦TTS框架,该框架构建于预训练的文本到语音骨干网络之上,可在统一潜在空间内分解说话人音色与构音障碍发音特征。病理原型码本为健康与构音障碍语音模式提供可解释且可控的表征,而结合梯度反转层的双分类器目标则强制说话人嵌入对病理属性保持不变。在TORGO数据集上的实验表明,该设计能够实现健康语音与构音障碍语音之间的双向转换,从而带来一致的ASR性能提升以及鲁棒的、说话人感知的语音重建。