The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.
翻译:自监督表征(如wav2vec 2.0)的出现使得说话人识别方法能够通过基于语音数据构建的基础模型处理语音信号。然而,由于采用固定或次优的时间池化策略,表征的有效融合仍需进一步研究。尽管考虑图学习和图注意力机制的改进策略已被提出,但这些方法中仍存在非单射聚合问题,可能影响说话人识别性能。为此,我们提出一种基于自监督表征的同构图注意力网络(IsoGAT)说话人识别方法。该方法包含表征学习、图注意力和聚合三个模块,联合考虑自监督表征学习与IsoGAT过程。随后,我们在VoxCeleb1\&2数据集上开展说话人识别任务实验,实验结果表明,与现有基于自监督表征的池化方法相比,所提方法具有更优的识别性能。