Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification.
翻译:多尺度特征的有效融合对于提升说话人验证性能至关重要。现有方法大多通过求和或拼接等简单操作以层级方式聚合多尺度特征。本文提出一种名为增强型Res2Net(ERes2Net)的新型架构,该架构融合了局部与全局特征融合技术以提升性能。局部特征融合(LFF)通过单个残差块内的特征融合提取局部信号;全局特征融合(GFF)则以不同尺度的声学特征为输入聚合全局信号。为促进LFF与GFF中的有效特征融合,ERes2Net架构采用注意力特征融合模块替代求和或拼接操作。在VoxCeleb数据集上开展的一系列实验证明了ERes2Net在说话人验证任务中的优越性。