We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task, leveraging self-supervision and sequence-to-sequence (Seq2Seq) learning techniques. Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis to simulate ground-truth speech. Despite utilizing simulated speech, our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric. Additionally, we present error rates and demonstrate our model's proficiency to synthesize speech in novel voices of interest. Moreover, we present a methodology for augmenting the existing CSTR NAM TIMIT Plus corpus, setting a benchmark with a Word Error Rate (WER) of 42.57% to gauge the intelligibility of the synthesized speech. Speech samples can be found at https://nam2speech.github.io/NAM2Speech/
翻译:我们提出了一种新颖方法,通过利用自监督和序列到序列学习技术,显著提升不可听耳语至语音转换任务中的语音可懂度。与显式录制真实语音的传统方法不同,我们的方法依赖于自监督和语音到语音合成技术来模拟真实语音。尽管使用了模拟语音,我们的方法在梅尔倒谱失真指标上超越了当前最佳性能,提升了29.08%。此外,我们展示了错误率,并证明了我们的模型能够合成具有新颖目标音色的语音。同时,我们提出了一种用于扩展现有CSTR NAM TIMIT Plus语料库的方法,并设定了42.57%的词错误率作为基准,以衡量合成语音的可懂度。语音样本可在https://nam2speech.github.io/NAM2Speech/ 获取。