Multi-talker conversational automatic speech recognition data are often used to train speaker diarization models. Because such data prioritize semantic continuity, pauses and boundary margins are included within speech segments, resulting in loose annotations. Models trained on such data tend to internalize mechanisms that reproduce this looseness, although tight speech intervals are sometimes preferable for downstream applications. In this paper, we address the novel task of enabling models to produce tight predictions using loose labels. Our method generates tighter pseudo labels using causal and anticausal models, which are inherently incapable of learning loosening behavior. We further propose a co-training scheme that iteratively tightens labels and updates both models for more progressive refinement. Experimental results show that the proposed method recovers about 70 % of the tightening effect achieved by ideal tight-label training and improves downstream performance.
翻译:多说话人对话场景下的自动语音识别数据常被用于训练说话人日志化模型。因这类数据优先考虑语义连续性,语音片段中会包含停顿和边界余量,导致标注较为松散。尽管下游应用有时更偏好紧致语音区间,但基于此类数据训练的模型往往内化机制以复现这种松散性。本文针对一个新颖任务——在仅使用松散标签的情况下使模型生成紧致预测——展开研究。我们通过因果模型与反因果模型生成更紧致的伪标签,这两类模型本质上无法学习到散漫行为。我们进一步提出协同训练方案,迭代式收紧标签并同时更新两类模型以实现渐进式优化。实验表明,该方法可恢复理想紧致标签训练所实现收紧效果的约70%,并提升下游任务性能。