Recent speech enhancement models have shown impressive performance gains by scaling up model complexity and training data. However, the impact of dataset variability (e.g. text, language, speaker, and noise) has been underexplored. Analyzing each attribute individually is often challenging, as multiple attributes are usually entangled in commonly used datasets, posing a significant obstacle in understanding the distinct contributions of each attribute to the model's performance. To address this challenge, we propose a generation-training-evaluation framework that leverages zero-shot text-to-speech systems to investigate the impact of controlled attribute variations on speech enhancement performance. It enables us to synthesize training datasets in a scalable manner while carefully altering each attribute. Based on the proposed framework, we analyze the scaling effects of various dataset attributes on the performance of both discriminative and generative SE models. Extensive experiments on multi-domain corpora imply that acoustic attributes (e.g., speaker and noise) are much more important to current speech enhancement models than semantic attributes (e.g., language and text), offering new insights for future research.
翻译:近期语音增强模型通过扩大模型复杂度和训练数据展现了令人印象深刻的性能提升。然而,数据集可变性(如文本、语言、说话者和噪声)的影响尚未得到充分探索。由于常用数据集中多个属性通常相互纠缠,单独分析每个属性往往具有挑战性,这为理解各属性对模型性能的独特贡献带来了重大障碍。为解决这一挑战,我们提出了一个生成-训练-评估框架,利用零样本文本转语音系统研究受控属性变化对语音增强性能的影响。该框架使我们能够以可扩展的方式合成训练数据集,同时精确调整每个属性。基于所提出的框架,我们分析了各种数据集属性对判别式和生成式语音增强模型性能的尺度效应。在多领域语料库上的大量实验表明,声学属性(如说话者和噪声)对当前语音增强模型的重要性远高于语义属性(如语言和文本),这为未来研究提供了新的见解。