As more speech technologies rely on a supervised deep learning approach with clean speech as the ground truth, a methodology to onboard said speech at scale is needed. However, this approach needs to minimize the dependency on human listening and annotation, only requiring a human-in-the-loop when needed. In this paper, we address this issue by outlining Speech Enhancement-based Curation Pipeline (SECP) which serves as a framework to onboard clean speech. This clean speech can then train a speech enhancement model, which can further refine the original dataset and thus close the iterative loop. By running two iterative rounds, we observe that enhanced output used as ground truth does not degrade model performance according to $\Delta_{PESQ}$, a metric used in this paper. We also show through comparative mean opinion score (CMOS) based subjective tests that the highest and lowest bound of refined data is perceptually better than the original data.
翻译:随着越来越多的语音技术依赖以清洁语音为监督信号的深度学习方法,大规模获取此类语音数据的方法需求日益迫切。然而,该方法需最大程度减少对人类听录与标注的依赖,仅在必要时引入人工环节。本文通过设计基于语音增强的筛选流程(SECP)作为清洁语音采集框架来应对这一挑战:由此获取的清洁语音可用于训练语音增强模型,该模型又能进一步优化原始数据集,从而形成迭代闭环。经过两轮迭代实验,根据本文采用的ΔPESQ指标观测,使用增强输出作为监督信号并未导致模型性能下降。同时,基于比较平均意见分的主观测试表明,经过数据优化后最高与最低置信度样本的感知质量均优于原始数据。