Electro-laryngeal (EL) speech is characterized by constant pitch, limited prosody, and mechanical noise, reducing naturalness and intelligibility. We propose a lightweight adaptation of the state-of-the-art StreamVC framework to this setting by removing pitch and energy modules and combining self-supervised pretraining with supervised fine-tuning on parallel EL and healthy (HE) speech data, guided by perceptual and intelligibility losses. Objective and subjective evaluations across different loss configurations confirm their influence: the best model variant, based on WavLM features and human-feedback predictions (+WavLM+HF), drastically reduces character error rate (CER) of EL inputs, raises naturalness mean opinion score (nMOS) from 1.1 to 3.3, and consistently narrows the gap to HE ground-truth speech in all evaluated metrics. These findings demonstrate the feasibility of adapting lightweight voice conversion architectures to EL voice rehabilitation while also identifying prosody generation and intelligibility improvements as the main remaining bottlenecks.
翻译:电子喉(EL)语音具有音高恒定、韵律受限及机械噪声等特点,其自然度与可懂度因此降低。本研究通过对现有先进框架StreamVC进行轻量化适配,移除了其中的基频与能量模块,并基于感知损失与可懂度损失的指导,将自监督预训练与并行EL/健康(HE)语音数据的监督微调相结合。不同损失配置的客观与主观评估证实了其影响:基于WavLM特征与人类反馈预测的最佳模型变体(+WavLM+HF)显著降低了EL输入的字错误率(CER),将自然度平均意见得分(nMOS)从1.1提升至3.3,并在所有评估指标上持续缩小与HE真实语音的差距。这些发现证明了轻量化音色转换架构适配于EL语音康复的可行性,同时指出韵律生成与可懂度提升仍是当前主要的技术瓶颈。