Recently, several very effective neural approaches for single-channel speech separation have been presented in the literature. However, due to the size and complexity of these models, their use on low-resource devices, e.g. for hearing aids, and earphones, is still a challenge and established solutions are not available yet. Although approaches based on either pruning or compressing neural models have been proposed, the design of a model architecture suitable for a certain application domain often requires heuristic procedures not easily portable to different low-resource platforms. Given the modular nature of the well-known Conv-Tasnet speech separation architecture, in this paper we consider three parameters that directly control the overall size of the model, namely: the number of residual blocks, the number of repetitions of the separation blocks and the number of channels in the depth-wise convolutions, and experimentally evaluate how they affect the speech separation performance. In particular, experiments carried out on the Libri2Mix show that the number of dilated 1D-Conv blocks is the most critical parameter and that the usage of extra-dilation in the residual blocks allows reducing the performance drop.
翻译:近期文献中提出了几种非常有效的单通道语音分离神经方法。然而,由于这些模型的规模和复杂性,其在低资源设备(例如助听器和耳机)上的应用仍面临挑战,且尚未有成熟的解决方案。尽管已有基于剪枝或压缩神经模型的方法被提出,但针对特定应用领域设计合适的模型架构通常需要启发式流程,难以便捷地迁移至不同的低资源平台。鉴于著名Conv-Tasnet语音分离架构的模块化特性,本文考虑了三个直接控制模型整体规模的参数:残差块数量、分离块重复次数以及深度可分卷积中的通道数,并通过实验评估它们对语音分离性能的影响。具体而言,在Libri2Mix上开展的实验表明,扩张一维卷积块数量是最关键的参数,且在残差块中使用额外扩张机制能够缓解性能下降。