Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks. EEND-VC generates thus multiple streams of embeddings, one for each speaker in a chunk. We can cluster these embeddings using constrained agglomerative hierarchical clustering (cAHC), ensuring embeddings from the same chunk belong to different clusters. This paper introduces an alternative clustering approach, a multi-stream extension of the successful Bayesian HMM clustering of x-vectors (VBx), called MS-VBx. Experiments on three datasets demonstrate that MS-VBx outperforms cAHC in diarization and speaker counting performance.
翻译:结合端到端神经说话人日志(EEND)与向量聚类(VC)的EEND-VC方法,因其兼具两种方法的优势而备受关注。该方法在单个音频片段内估计所有说话人的活动状态及其嵌入向量,并通过VC将这些活动状态与跨片段的说话人身份进行关联。这一过程生成了多个嵌入流,每个流对应片段中的一个说话人。通过约束凝聚层次聚类(cAHC)对这些嵌入进行聚类时,可确保同一片段内的嵌入属于不同聚类。本文提出了一种替代性聚类方法——基于成功应用于x-向量聚类的变分贝叶斯HMM(VBx)的多流扩展结构,命名为MS-VBx。在三个数据集上的实验表明,MS-VBx在日志准确率和说话人计数性能上均优于cAHC。