Common target sound extraction (TSE) approaches primarily relied on discriminative approaches in order to separate the target sound while minimizing interference from the unwanted sources, with varying success in separating the target from the background. This study introduces DPM-TSE, a first generative method based on diffusion probabilistic modeling (DPM) for target sound extraction, to achieve both cleaner target renderings as well as improved separability from unwanted sounds. The technique also tackles common background noise issues with DPM by introducing a correction method for noise schedules and sample steps. This approach is evaluated using both objective and subjective quality metrics on the FSD Kaggle 2018 dataset. The results show that DPM-TSE has a significant improvement in perceived quality in terms of target extraction and purity.
翻译:常见的目标声音提取(TSE)方法主要依赖判别式方法,以在分离目标声音的同时最小化无关源的干扰,但在目标与背景的分离效果上各有差异。本研究提出DPM-TSE,一种首个基于扩散概率建模(DPM)的生成式目标声音提取方法,旨在实现更纯净的目标渲染效果及更优的与无关声音分离能力。该技术还通过引入噪声调度和采样步长的校正方法,解决了DPM中常见的背景噪声问题。该方法在FSD Kaggle 2018数据集上使用客观和主观质量指标进行评估。结果表明,DPM-TSE在目标提取及纯净度方面的感知质量有显著提升。