This paper proposes an online target speaker voice activity detection system for speaker diarization tasks, which does not require a priori knowledge from the clustering-based diarization system to obtain the target speaker embeddings. By adapting the conventional target speaker voice activity detection for real-time operation, this framework can identify speaker activities using self-generated embeddings, resulting in consistent performance without permutation inconsistencies in the inference phase. During the inference process, we employ a front-end model to extract the frame-level speaker embeddings for each coming block of a signal. Next, we predict the detection state of each speaker based on these frame-level speaker embeddings and the previously estimated target speaker embedding. Then, the target speaker embeddings are updated by aggregating these frame-level speaker embeddings according to the predictions in the current block. Our model predicts the results for each block and updates the target speakers' embeddings until reaching the end of the signal. Experimental results show that the proposed method outperforms the offline clustering-based diarization system on the DIHARD III and AliMeeting datasets. The proposed method is further extended to multi-channel data, which achieves similar performance with the state-of-the-art offline diarization systems.
翻译:本文提出了一种用于说话人日志任务的在线目标说话人语音活动检测系统,该系统无需从基于聚类的说话人日志系统中获取先验知识即可得到目标说话人嵌入向量。通过将传统目标说话人语音活动检测适配至实时运行,该框架能够利用自生成嵌入向量识别说话人活动,在推理阶段实现性能一致性且避免排列不一致问题。在推理过程中,我们采用前端模型为每个信号块提取帧级说话人嵌入向量。随后,基于这些帧级说话人嵌入向量与先前估计的目标说话人嵌入向量,预测每个说话人的检测状态。接着,根据当前块的预测结果聚合帧级嵌入向量以更新目标说话人嵌入。模型逐块预测结果并持续更新目标说话人嵌入,直至信号结束。实验结果表明,所提方法在DIHARD III和AliMeeting数据集上优于基于聚类的离线说话人日志系统。该方法进一步扩展至多通道数据,达到了与最先进的离线说话人日志系统相当的性能。