In this study, we present an approach to train a single speech enhancement network that can perform both personalized and non-personalized speech enhancement. This is achieved by incorporating a frame-wise conditioning input that specifies the type of enhancement output. To improve the quality of the enhanced output and mitigate oversuppression, we experiment with re-weighting frames by the presence or absence of speech activity and applying augmentations to speaker embeddings. By training under a multi-task learning setting, we empirically show that the proposed unified model obtains promising results on both personalized and non-personalized speech enhancement benchmarks and reaches similar performance to models that are trained specialized for either task. The strong performance of the proposed method demonstrates that the unified model is a more economical alternative compared to keeping separate task-specific models during inference.
翻译:在本研究中,我们提出了一种训练单个语音增强网络的方法,该网络能够同时执行个性化与非个性化语音增强。这是通过引入帧级条件输入来实现的,该输入指定了增强输出的类型。为了提升增强输出的质量并缓解过度抑制问题,我们尝试根据语音活动存在与否对帧进行重新加权,并对说话人嵌入施加数据增强。通过多任务学习设置下的训练,我们经验性地证明,所提出的统一模型在个性化与非个性化语音增强基准测试中均取得了令人瞩目的结果,并达到了与各自任务专精模型相近的性能。该方法的强劲表现表明,在推理过程中,统一模型相比维护独立的特定任务模型是一种更经济的选择。