End-to-end neural diarization models have usually relied on a multilabel-classification formulation of the speaker diarization problem. Recently, we proposed a powerset multiclass formulation that has beaten the state-of-the-art on multiple datasets. In this paper, we propose to study the calibration of a powerset speaker diarization model, and explore some of its uses. We study the calibration in-domain, as well as out-of-domain, and explore the data in low-confidence regions. The reliability of model confidence is then tested in practice: we use the confidence of the pretrained model to selectively create training and validation subsets out of unannotated data, and compare this to random selection. We find that top-label confidence can be used to reliably predict high-error regions. Moreover, training on low-confidence regions provides a better calibrated model, and validating on low-confidence regions can be more annotation-efficient than random regions.
翻译:端到端神经日志模型通常依赖于说话人日志问题的多标签分类公式。最近,我们提出了一种幂集多类别公式,该公式已在多个数据集上超越了现有技术水平。本文旨在研究幂集说话人日志模型的校准,并探讨其若干应用。我们研究了模型在域内和域外的校准情况,并深入分析了低置信度区域的数据。随后,我们在实践中测试了模型置信度的可靠性:利用预训练模型的置信度,从未标注数据中有选择地创建训练和验证子集,并将其与随机选择进行比较。研究发现,顶标置信度可用于可靠地预测高误差区域。此外,在低置信度区域上进行训练能获得校准更优的模型,而在低置信度区域上进行验证相比随机区域可能具有更高的标注效率。