Generative speech enhancement (GSE) models show great promise in producing high-quality clean speech from noisy inputs, enabling applications such as curating noisy text-to-speech (TTS) datasets into high-quality ones. However, GSE models are prone to hallucination errors, such as phoneme omissions and speaker inconsistency, which conventional error filtering based on non-intrusive speech quality metrics often fails to detect. To address this issue, we propose a non-intrusive method for filtering hallucination errors from discrete token-based GSE models. Our method leverages the log-probabilities of generated tokens as confidence scores to detect potential errors. Experimental results show that the confidence scores strongly correlate with a suite of intrusive SE metrics, and that our method effectively identifies hallucination errors missed by conventional filtering methods. Furthermore, we demonstrate the practical utility of our method: curating an in-the-wild TTS dataset with our confidence-based filtering improves the performance of subsequently trained TTS models.
翻译:生成式语音增强模型在从带噪输入中生成高质量纯净语音方面展现出巨大潜力,使得将带噪文本转语音数据集优化为高质量数据集成为可能。然而,此类模型易产生幻觉误差,如音素缺失和说话人不一致等问题,而基于非侵入式语音质量指标的传统误差筛选方法往往难以有效检测此类误差。为解决该问题,我们提出一种非侵入式方法,用于从基于离散标记的生成式语音增强模型中滤除幻觉误差。本方法利用生成标记的对数概率作为置信度分数来检测潜在误差。实验结果表明,置信度分数与一系列侵入式语音增强评价指标具有强相关性,且本方法能有效识别传统筛选方法遗漏的幻觉误差。此外,我们验证了本方法的实际应用价值:通过置信度筛选机制优化真实场景文本转语音数据集,可显著提升后续训练文本转语音模型的性能。