Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.
翻译:说话人日志作为基于说话人身份对音频录音进行分段的语音预处理任务,在多个下游应用中具有重要作用。传统的日志方法涉及多个嵌入提取和聚类步骤,这些步骤通常以孤立方式进行优化。尽管端到端日志系统尝试学习单一模型来完成该任务,但其训练过程往往较为繁琐且需要大规模监督数据集。本文提出一种基于图神经网络(GNN)的端到端监督式分层聚类算法,称为端到端监督式分层聚类(E-SHARC)。嵌入提取器通过预训练的x-vector模型进行初始化,而GNN模型则使用预训练模型生成的x-vector嵌入进行初步训练。最终,E-SHARC模型以前端梅尔滤波器组特征作为输入,联合优化嵌入提取器和GNN聚类模块,实现端到端优化下的表征学习、度量学习和聚类。此外,通过引入外部重叠检测器的附加输入,E-SHARC方法能够预测重叠语音区域中的说话人。在AMI、Voxconverse和DISPLACE等基准数据集上的实验评估表明,所提出的E-SHARC框架通过基于图的聚类方法取得了具有竞争力的日志结果。