The performance of neural network-based speech enhancement systems is primarily influenced by the model architecture, whereas training times and computational resource utilization are primarily affected by training parameters such as the batch size. Since noisy and reverberant speech mixtures can have different duration, a batching strategy is required to handle variable size inputs during training, in particular for state-of-the-art end-to-end systems. Such strategies usually strive for a compromise between zero-padding and data randomization, and can be combined with a dynamic batch size for a more consistent amount of data in each batch. However, the effect of these strategies on resource utilization and more importantly network performance is not well documented. This paper systematically investigates the effect of different batching strategies and batch sizes on the training statistics and speech enhancement performance of a Conv-TasNet, evaluated in both matched and mismatched conditions. We find that using a small batch size during training improves performance in both conditions for all batching strategies. Moreover, using sorted or bucket batching with a dynamic batch size allows for reduced training time and GPU memory usage while achieving similar performance compared to random batching with a fixed batch size.
翻译:基于神经网络的语音增强系统性能主要受模型架构影响,而训练时间与计算资源利用率则主要受批次大小等训练参数制约。由于含噪混响语音混合信号的时长存在差异,训练过程中需要采用批处理策略来处理可变尺寸输入——这一需求对当前先进的端到端系统尤为关键。此类策略通常需要在零填充与数据随机化之间寻求平衡,并可结合动态批次大小实现各批次数据量的一致性。然而,这些策略对资源利用率及网络性能的影响至今缺乏系统文献记载。本文系统研究了不同批处理策略与批次大小对Conv-TasNet训练统计特性及语音增强性能的影响,并在匹配与非匹配条件下进行了评估。研究发现:在所有批处理策略中,采用小批次训练均能提升两种条件下的性能表现。此外,相较于固定批次大小的随机批处理,采用动态批次大小的排序或分桶批处理可在保持相似性能的同时减少训练时间与GPU内存占用。