Speech separation remains an important area of multi-speaker signal processing. Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks. Some of these models can take significant time to train and have high memory requirements. Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood. In this work, the impact of applying these training signal length (TSL) limits is analysed for two speech separation models: SepFormer, a transformer model, and Conv-TasNet, a convolutional model. The WJS0-2Mix, WHAMR and Libri2Mix datasets are analysed in terms of signal length distribution and its impact on training efficiency. It is demonstrated that, for specific distributions, applying specific TSL limits results in better performance. This is shown to be mainly due to randomly sampling the start index of the waveforms resulting in more unique examples for training. A SepFormer model trained using a TSL limit of 4.42s and dynamic mixing (DM) is shown to match the best-performing SepFormer model trained with DM and unlimited signal lengths. Furthermore, the 4.42s TSL limit results in a 44% reduction in training time with WHAMR.
翻译:语音分离仍是多说话人信号处理的重要领域。深度神经网络(DNN)模型已在众多语音分离基准测试中取得最优性能。部分模型训练耗时显著且内存需求较高。以往研究提出缩短训练样本长度以解决这些问题,但此举对模型性能的影响尚不明确。本研究分析了两种语音分离模型——Transformer架构的SepFormer与卷积架构的Conv-TasNet——在应用训练信号长度(TSL)限制时的影响。通过分析WJS0-2Mix、WHAMR和Libri2Mix数据集的信号长度分布及其对训练效率的影响,证明针对特定分布采用特定TSL限制可获得更优性能。研究表明,这主要源于对波形起始索引进行随机采样,从而为训练生成更多独特样本。采用4.42秒TSL限制与动态混合(DM)训练的SepFormer模型,其性能与使用无限制信号长度及DM训练的最优SepFormer模型相当。此外,在WHAMR数据集上,4.42秒TSL限制使训练时间缩短44%。