The emergence of self-supervised representation (i.e., wav2vec 2.0) allows speaker-recognition approaches to process spoken signals through foundation models built on speech data. Nevertheless, effective fusion on the representation requires further investigating, due to the inclusion of fixed or sub-optimal temporal pooling strategies. Despite of improved strategies considering graph learning and graph attention factors, non-injective aggregation still exists in the approaches, which may influence the performance for speaker recognition. In this regard, we propose a speaker recognition approach using Isomorphic Graph ATtention network (IsoGAT) on self-supervised representation. The proposed approach contains three modules of representation learning, graph attention, and aggregation, jointly considering learning on the self-supervised representation and the IsoGAT. Then, we perform experiments for speaker recognition tasks on VoxCeleb1\&2 datasets, with the corresponding experimental results demonstrating the recognition performance for the proposed approach, compared with existing pooling approaches on the self-supervised representation.
翻译:自监督表示(如wav2vec 2.0)的出现,使得说话人识别方法能够通过基于语音数据构建的基础模型处理语音信号。然而,由于采用固定或次优的时间池化策略,该表示的有效融合仍需进一步研究。尽管已有研究提出了基于图学习和图注意力机制的改进策略,但现有方法仍存在非单射聚合问题,这可能影响说话人识别的性能。为此,我们提出一种基于自监督表示的同构图注意力网络(IsoGAT)的说话人识别方法。该方法包含表示学习、图注意力及聚合三个模块,联合考虑了自监督表示学习与IsoGAT。随后,我们在VoxCeleb1和VoxCeleb2数据集上进行了说话人识别任务实验,结果表明,与现有基于自监督表示的池化方法相比,所提方法具有更优的识别性能。