Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech signals. Due to an inadequate amount of available data and noisy signals, the synthesised speech often exhibits a low level of naturalness. In this work, we propose Diff-ETS, an ETS model which uses a score-based diffusion probabilistic model to enhance the naturalness of synthesised speech. The diffusion model is applied to improve the quality of the acoustic features predicted by an EMG encoder. In our experiments, we evaluated fine-tuning the diffusion model on predictions of a pre-trained EMG encoder, and training both models in an end-to-end fashion. We compared Diff-ETS with a baseline ETS model without diffusion using objective metrics and a listening test. The results indicated the proposed Diff-ETS significantly improved speech naturalness over the baseline.
翻译:肌电-语音(ETS)转换通过从无声发音过程中的肌电(EMG)信号生成可听语音,已展现出其在无声语音接口中的潜力。ETS模型通常由将EMG信号转换为声学语音特征的EMG编码器,以及随后合成语音信号的声码器组成。由于可用数据量不足及信号噪声干扰,合成语音往往呈现较低的自然度。本研究提出Diff-ETS模型,该ETS模型采用基于得分的扩散概率模型来提升合成语音的自然度。该扩散模型被应用于改善EMG编码器预测的声学特征质量。实验中,我们评估了在预训练EMG编码器的预测结果上微调扩散模型,以及以端到端方式联合训练两种模型的效果。通过客观指标和听力测试,我们将Diff-ETS与不含扩散模块的基线ETS模型进行了对比。结果表明,所提出的Diff-ETS相较于基线模型显著提升了语音自然度。