This paper details our speaker diarization system designed for multi-domain, multi-microphone casual conversations. The proposed diarization pipeline uses weighted prediction error (WPE)-based dereverberation as a front end, then applies end-to-end neural diarization with vector clustering (EEND-VC) to each channel separately. It integrates the diarization result obtained from each channel using diarization output voting error reduction plus overlap (DOVER-LAP). To harness the knowledge from the target domain and results integrated across all channels, we apply self-supervised adaptation for each session by retraining the EEND-VC with pseudo-labels derived from DOVER-LAP. The proposed system was incorporated into NTT's submission for the distant automatic speech recognition task in the CHiME-7 challenge. Our system achieved 65 % and 62 % relative improvements on development and eval sets compared to the organizer-provided VC-based baseline diarization system, securing third place in diarization performance.
翻译:本文详细介绍了一种面向多域、多麦克风非正式对话的说话人日志系统。所提出的日志处理流程以基于加权预测误差(WPE)的去混响作为前端,随后对每个通道分别应用结合向量聚类的端到端神经日志(EEND-VC)。通过日志输出投票错误减少与重叠融合(DOVER-LAP)整合各通道获得的日志结果。为利用目标域知识及所有通道整合结果,我们采用自监督自适应方法,基于DOVER-LAP生成的伪标签对每个会话重新训练EEND-VC。该系统被纳入NTT针对CHiME-7挑战赛远距离自动语音识别任务的提交方案中。与组织者提供的基于VC的基线日志系统相比,我们的系统在开发集和评估集上分别实现了65%和62%的相对改进,在日志性能排名中位列第三。