In speaker diarisation, speaker embedding extraction models often suffer from the mismatch between their training loss functions and the speaker clustering method. In this paper, we propose the method of spectral clustering-aware learning of embeddings (SCALE) to address the mismatch. Specifically, besides an angular prototype cal (AP) loss, SCALE uses a novel affinity matrix loss which directly minimises the error between the affinity matrix estimated from speaker embeddings and the reference. SCALE also includes p-percentile thresholding and Gaussian blur as two important hyper-parameters for spectral clustering in training. Experiments on the AMI dataset showed that speaker embeddings obtained with SCALE achieved over 50% relative speaker error rate reductions using oracle segmentation, and over 30% relative diarisation error rate reductions using automatic segmentation when compared to a strong baseline with the AP-loss-based speaker embeddings.
翻译:在说话人日志任务中,说话人嵌入提取模型常因其训练损失函数与说话人聚类方法不匹配而性能受损。本文提出谱聚类感知嵌入学习(SCALE)方法以解决该不匹配问题。具体而言,除角度原型(AP)损失外,SCALE还采用一种新型亲和矩阵损失,该损失直接最小化从说话人嵌入估计的亲和矩阵与参考矩阵之间的误差。SCALE还引入p百分位阈值化与高斯模糊作为训练过程中谱聚类的两个重要超参数。在AMI数据集上的实验表明,与基于AP损失的强基线说话人嵌入相比,采用SCALE获得的说话人嵌入在使用真实分割时实现了超过50%的相对说话人错误率降低,在使用自动分割时实现了超过30%的相对日志错误率降低。