Strong representations of target speakers can help extract important information about speakers and detect corresponding temporal regions in multi-speaker conversations. In this study, we propose a neural architecture that simultaneously extracts speaker representations consistent with the speaker diarization objective and detects the presence of each speaker on a frame-by-frame basis regardless of the number of speakers in a conversation. A speaker representation (called z-vector) extractor and a time-speaker contextualizer, implemented by a residual network and processing data in both temporal and speaker dimensions, are integrated into a unified framework. Tests on the CALLHOME corpus show that our model outperforms most of the methods proposed so far. Evaluations in a more challenging case with simultaneous speakers ranging from 2 to 7 show that our model achieves 6.4% to 30.9% relative diarization error rate reductions over several typical baselines.
翻译:目标说话人的强表征有助于从多人对话中提取关键的说话人信息,并检测其对应的时间区域。本研究提出一种神经架构,可同时提取与说话人日志目标一致的说话人表征,并逐帧检测每位说话人的存在,不受对话中说话人数量的限制。该架构将基于残差网络实现的说话人表征(称为z-vector)提取器与时间-说话人上下文处理器整合至统一框架中,后者在时间维和说话人维上处理数据。在CALLHOME语料库上的测试表明,本模型优于目前提出的大多数方法。在更具挑战性的场景(包含2至7位同时说话的说话人)中评估显示,与若干典型基线相比,本模型实现了6.4%至30.9%的相对日志错误率降低。