We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.
翻译:我们提出了一种新颖的框架,通过使用鲁棒语言编码器来增强电子喉语音的清晰度。预训练和微调方法已被证明在该任务中效果良好,但在大多数情况下,各种不匹配问题(例如语音类型不匹配(电子喉语音与正常语音)或各阶段所用数据集之间的说话人不匹配)可能会降低该框架的转换性能。为解决这一问题,我们提出了一种鲁棒性足够强的语言编码器,能够将电子喉语音和正常语音投影到相同的潜在空间中,同时仍能提取准确的 linguistic 信息,从而创建统一的表示以减少语音类型不匹配。此外,我们将HuBERT输出特征引入所提出的框架中,以减少说话人不匹配,从而在预训练期间有效利用大规模并行数据集。我们表明,与使用梅尔频谱图输入和输出特征的常规框架相比,使用所提出的框架能够使模型合成更清晰、更自然的语音,字符错误率显著降低16%,自然度评分提升0.83。