In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call TS-ConvNeXt. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.
翻译:在说话人验证任务中,ECAPA-TDNN通过采用一维Res2Net模块与压缩-激励模块,并结合多层特征聚合技术,取得了显著性能提升。与此同时,在视觉任务中,ConvNet结构通过借鉴Transformer进行了现代化改造,从而获得了更优表现。本文针对说话人验证中的TDNN提出了一种改进的模块设计。受近期ConvNet结构的启发,我们使用新型一维两步式多尺度ConvNeXt模块(称为TS-ConvNeXt)替代了ECAPA-TDNN中的SE-Res2Net模块。TS-ConvNeXt模块由两个分离的子模块构成:时序多尺度卷积模块与帧级前馈网络模块。这种两步式设计能够灵活捕捉帧间与帧内上下文信息。此外,我们为前馈网络模块引入了全局响应归一化,使其能够实现更具选择性的特征传播——其功能类似于ECAPA-TDNN中的SE模块。实验结果表明,采用现代化骨干模块的NeXt-TDNN在说话人验证任务中显著提升了性能,同时减少了参数量与推理时间。我们已公开相关代码以供后续研究。