The Graph feature fusion technique for speaker recognition based on wav2vec2.0 framework

Pre-trained wav2vec2.0 model has been proved its effectiveness for speaker recognition. However, current feature processing methods are focusing on classical pooling on the output features of the pre-trained wav2vec2.0 model, such as mean pooling, max pooling etc. That methods take the features as the independent and irrelevant units, ignoring the inter-relationship among all the features, and do not take the features as an overall representation of a speaker. Gated Recurrent Unit (GRU), as a feature fusion method, can also be considered as a complicated pooling technique, mainly focuses on the temporal information, which may show poor performance in some situations that the main information is not on the temporal dimension. In this paper, we investigate the graph neural network (GNN) as a backend processing module based on wav2vec2.0 framework to provide a solution for the mentioned matters. The GNN takes all the output features as the graph signal data and extracts the related graph structure information of features for speaker recognition. Specifically, we first give a simple proof that the GNN feature fusion method can outperform than the mean, max, random pooling methods and so on theoretically. Then, we model the output features of wav2vec2.0 as the vertices of a graph, and construct the graph adjacency matrix by graph attention network (GAT). Finally, we follow the message passing neural network (MPNN) to design our message function, vertex update function and readout function to transform the speaker features into the graph features. The experiments show our performance can provide a relative improvement compared to the baseline methods. Code is available at xxx.

翻译：预训练的wav2vec2.0模型已被证明在说话人识别任务中具有有效性。然而，当前的特征处理方法主要集中于对预训练wav2vec2.0模型输出特征进行经典池化操作，如均值池化、最大池化等。这些方法将特征视为独立且无关的单元，忽略了各特征之间的内在关联，未能将特征作为说话人的整体表征。门控循环单元（GRU）作为一种特征融合方法，可被视为一种复杂的池化技术，但其主要关注时序信息，在主信息不位于时序维度的场景中可能表现不佳。本文研究基于wav2vec2.0框架，将图神经网络（GNN）作为后端处理模块，以解决上述问题。GNN将所有输出特征视为图信号数据，并提取特征间的图结构信息用于说话人识别。具体而言，我们首先从理论上简要证明GNN特征融合方法能够优于均值池化、最大池化、随机池化等方法。随后，我们将wav2vec2.0的输出特征建模为图的顶点，并通过图注意力网络（GAT）构建图的邻接矩阵。最后，我们遵循消息传递神经网络（MPNN）的框架，设计消息函数、顶点更新函数和读出函数，将说话人特征转换为图特征。实验结果表明，与基线方法相比，我们的性能取得了相对提升。代码地址为：xxx。