We propose a novel framework for electrolaryngeal speech intelligibility enhancement through the use of robust linguistic encoders. Pretraining and fine-tuning approaches have proven to work well in this task, but in most cases, various mismatches, such as the speech type mismatch (electrolaryngeal vs. typical) or a speaker mismatch between the datasets used in each stage, can deteriorate the conversion performance of this framework. To resolve this issue, we propose a linguistic encoder robust enough to project both EL and typical speech in the same latent space, while still being able to extract accurate linguistic information, creating a unified representation to reduce the speech type mismatch. Furthermore, we introduce HuBERT output features to the proposed framework for reducing the speaker mismatch, making it possible to effectively use a large-scale parallel dataset during pretraining. We show that compared to the conventional framework using mel-spectrogram input and output features, using the proposed framework enables the model to synthesize more intelligible and naturally sounding speech, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score.
翻译:我们提出了一种基于鲁棒语言编码器的电子喉语音清晰度增强新框架。预训练与微调方法在该任务中已证明有效,但在多数情况下,各阶段所用数据集之间的语音类型不匹配(电子喉语音与正常语音)或说话人不匹配会降低该框架的转换性能。为解决这一问题,我们提出一种鲁棒的语言编码器,该编码器能够将电子喉语音与正常语音投影至同一潜在空间,同时仍能提取准确的语音信息,从而创建统一表征以减轻语音类型不匹配。此外,我们将HuBERT输出特征引入所提框架以减少说话人不匹配,使得在预训练阶段有效利用大规模并行数据集成为可能。实验表明,与使用梅尔频谱图输入/输出特征的常规框架相比,所提框架能使模型合成更清晰自然的声音,字符错误率显著降低16%,自然度评分提升0.83。