Effective fusion of multi-scale features is crucial for improving speaker verification performance. While most existing methods aggregate multi-scale features in a layer-wise manner via simple operations, such as summation or concatenation. This paper proposes a novel architecture called Enhanced Res2Net (ERes2Net), which incorporates both local and global feature fusion techniques to improve the performance. The local feature fusion (LFF) fuses the features within one single residual block to extract the local signal. The global feature fusion (GFF) takes acoustic features of different scales as input to aggregate global signal. To facilitate effective feature fusion in both LFF and GFF, an attentional feature fusion module is employed in the ERes2Net architecture, replacing summation or concatenation operations. A range of experiments conducted on the VoxCeleb datasets demonstrate the superiority of the ERes2Net in speaker verification. Code has been made publicly available at https://github.com/alibaba-damo-academy/3D-Speaker.
翻译:多尺度特征的有效融合对于提升说话人验证性能至关重要。然而,现有方法大多通过简单操作(如求和或拼接)以逐层方式聚合多尺度特征。本文提出一种名为增强型Res2Net(ERes2Net)的新型架构,该架构融合了局部与全局特征融合技术以提升性能。局部特征融合(LFF)通过融合单个残差块内的特征来提取局部信号,全局特征融合(GFF)则以不同尺度的声学特征作为输入聚合全局信号。为促进LFF和GFF中的有效特征融合,ERes2Net架构采用注意力特征融合模块替代求和或拼接操作。在VoxCeleb数据集上开展的一系列实验证明了ERes2Net在说话人验证中的优越性。相关代码已在 https://github.com/alibaba-damo-academy/3D-Speaker 公开。