In speaker verification, ECAPA-TDNN has shown remarkable improvement by utilizing one-dimensional(1D) Res2Net block and squeeze-and-excitation(SE) module, along with multi-layer feature aggregation (MFA). Meanwhile, in vision tasks, ConvNet structures have been modernized by referring to Transformer, resulting in improved performance. In this paper, we present an improved block design for TDNN in speaker verification. Inspired by recent ConvNet structures, we replace the SE-Res2Net block in ECAPA-TDNN with a novel 1D two-step multi-scale ConvNeXt block, which we call \textit{TS-ConvNeXt}. The TS-ConvNeXt block is constructed using two separated sub-modules: a temporal multi-scale convolution (MSC) and a frame-wise feed-forward network (FFN). This two-step design allows for flexible capturing of inter-frame and intra-frame contexts. Additionally, we introduce global response normalization (GRN) for the FFN modules to enable more selective feature propagation, similar to the SE module in ECAPA-TDNN. Experimental results demonstrate that NeXt-TDNN, with a modernized backbone block, significantly improved performance in speaker verification tasks while reducing parameter size and inference time. We have released our code for future studies.
翻译:在说话人验证任务中,ECAPA-TDNN通过引入一维Res2Net块和挤压激励(SE)模块,并结合多层特征聚合(MFA)取得了显著性能提升。与此同时,在视觉任务中,卷积网络结构通过借鉴Transformer实现了现代化改造,从而获得更优性能。本文针对说话人验证中的TDNN提出一种改进的块设计。受近期卷积网络结构启发,我们将ECAPA-TDNN中的SE-Res2Net块替换为一种新颖的一维两步多尺度ConvNeXt块(称为TS-ConvNeXt)。TS-ConvNeXt块由两个分离的子模块构成:时域多尺度卷积(MSC)和逐帧前馈网络(FFN)。这种两步设计能够灵活捕获帧间与帧内上下文信息。此外,我们在FFN模块中引入全局响应归一化(GRN),以实现更具选择性的特征传播,功能类似于ECAPA-TDNN中的SE模块。实验结果表明,采用现代化骨干块的NeXt-TDNN在说话人验证任务中显著提升了性能,同时减少了参数量和推理时间。我们已公开代码以供后续研究参考。