Traditional Time Delay Neural Networks (TDNN) have achieved state-of-the-art performance at the cost of high computational complexity and slower inference speed, making them difficult to implement in an industrial environment. The Densely Connected Time Delay Neural Network (D-TDNN) with Context Aware Masking (CAM) module has proven to be an efficient structure to reduce complexity while maintaining system performance. In this paper, we propose a fast and lightweight model, LightCAM, which further adopts a depthwise separable convolution module (DSM) and uses multi-scale feature aggregation (MFA) for feature fusion at different levels. Extensive experiments are conducted on VoxCeleb dataset, the comparative results show that it has achieved an EER of 0.83 and MinDCF of 0.0891 in VoxCeleb1-O, which outperforms the other mainstream speaker verification methods. In addition, complexity analysis further demonstrates that the proposed architecture has lower computational cost and faster inference speed.
翻译:传统时延神经网络(TDNN)以高计算复杂度和缓慢推理速度为代价取得了最先进性能,使其难以在工业环境中部署。具有上下文感知掩码(CAM)模块的密集连接时延神经网络(D-TDNN)已被证明是一种在保持系统性能的同时降低复杂性的高效结构。本文提出一种快速轻量模型LightCAM,该模型进一步采用深度可分离卷积模块(DSM),并使用多尺度特征聚合(MFA)在不同层级进行特征融合。在VoxCeleb数据集上进行的广泛实验表明,该方法在VoxCeleb1-O上实现了0.83的等错误率(EER)和0.0891的最小检测代价函数(MinDCF),优于其他主流说话人验证方法。此外,复杂度分析进一步证明所提架构具有更低的计算成本和更快的推理速度。