Since its introduction in 2019, the whole end-to-end neural diarization (EEND) line of work has been addressing speaker diarization as a frame-wise multi-label classification problem with permutation-invariant training. Despite EEND showing great promise, a few recent works took a step back and studied the possible combination of (local) supervised EEND diarization with (global) unsupervised clustering. Yet, these hybrid contributions did not question the original multi-label formulation. We propose to switch from multi-label (where any two speakers can be active at the same time) to powerset multi-class classification (where dedicated classes are assigned to pairs of overlapping speakers). Through extensive experiments on 9 different benchmarks, we show that this formulation leads to significantly better performance (mostly on overlapping speech) and robustness to domain mismatch, while eliminating the detection threshold hyperparameter, critical for the multi-label formulation.
翻译:自2019年提出以来,完整的端到端神经说话人日志(EEND)系列工作一直将说话人日志视为具有置换不变训练的帧级多标签分类问题。尽管EEND展现出巨大潜力,但近期一些研究退一步探讨了(局部)监督式EEND日志与(全局)无监督聚类的可能组合。然而,这些混合方法并未质疑原始的多标签公式。我们提出从多标签(允许任意两个说话人同时活跃)切换到幂集多类分类(为重叠说话人对分配专用类别)。通过在9个不同基准上的广泛实验,我们表明该公式显著提升了性能(主要针对重叠语音),并增强了对域不匹配的鲁棒性,同时消除了多标签公式中关键的检测阈值超参数。