Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
翻译:构音障碍语音重建系统旨在将构音障碍语音自动转换为正常语音。该技术可缓解神经运动障碍患者与他人的沟通障碍,并增强其社会融入。与基于生成对抗网络的方法相比,基于神经编码器-解码器的NED系统显著提升了重建语音的可懂度,但由于级联流水线和内容编码器的辅助任务导致的训练效率低下,该方法仍受限,进而影响重建质量。受自监督语音表征学习与离散语音单元的启发,我们提出Unit-DSR系统,利用HuBERT强大的领域自适应能力提升训练效率,并借助语音单元在离散语言空间中约束构音障碍内容的修复。与NED方法相比,Unit-DSR系统仅包含语音单元归一化器和Unit HiFi-GAN声码器,无需级联子模块或辅助任务,结构显著简化。在UASpeech语料库上的实验结果表明,Unit-DSR在内容恢复方面优于多个强基线方法,相较于原始构音障碍语音,词错误率相对降低28.2%,且对语速扰动和噪声具有鲁棒性。