Currently, most speaker recognition backends, such as cosine, linear discriminant analysis (LDA), or probabilistic linear discriminant analysis (PLDA), make decisions by calculating similarity or distance between enrollment and test embeddings which are already extracted from neural networks. However, for each embedding, the local structure of itself and its neighbor embeddings in the low-dimensional space is different, which may be helpful for the recognition but is often ignored. In order to take advantage of it, we propose a graph neural network (GNN) backend to mine latent relationships among embeddings for classification. We assume all the embeddings as nodes on a graph, and their edges are computed based on some similarity function, such as cosine, LDA+cosine, or LDA+PLDA. We study different graph settings and explore variants of GNN to find a better message passing and aggregation way to accomplish the recognition task. Experimental results on NIST SRE14 i-vector challenging, VoxCeleb1-O, VoxCeleb1-E, and VoxCeleb1-H datasets demonstrate that our proposed GNN backends significantly outperform current mainstream methods.
翻译:当前,大多数说话人识别后端(如余弦、线性判别分析(LDA)或概率线性判别分析(PLDA))通过计算已从神经网络提取的注册向量与测试向量之间的相似度或距离来做出决策。然而,对于每个向量,其自身及邻近向量在低维空间中的局部结构各不相同,这种结构可能对识别有帮助,但往往被忽略。为利用这一特性,我们提出了一种图神经网络(GNN)后端,通过挖掘向量间的潜在关系进行分类。我们将所有向量视为图上的节点,节点间的边基于某种相似度函数(如余弦、LDA+余弦或LDA+PLDA)计算。我们研究了不同的图设置,并探索了GNN的变体,以找到更好的消息传递与聚合方式来完成识别任务。在NIST SRE14 i-vector挑战数据集、VoxCeleb1-O、VoxCeleb1-E和VoxCeleb1-H数据集上的实验结果表明,我们提出的GNN后端显著优于当前主流方法。