Voice Activity Detection (VAD) and Overlapped Speech Detection (OSD) are key pre-processing tasks for speaker diarization. In the meeting context, it is often easier to capture speech with a distant device. This consideration however leads to severe performance degradation. We study a unified supervised learning framework to solve distant multi-microphone joint VAD and OSD (VAD+OSD). This paper investigates various multi-channel VAD+OSD front-ends that weight and combine incoming channels. We propose three algorithms based on the Self-Attention Channel Combinator (SACC), previously proposed in the literature. Experiments conducted on the AMI meeting corpus exhibit that channel combination approaches bring significant VAD+OSD improvements in the distant speech scenario. Specifically, we explore the use of learned complex combination weights and demonstrate the benefits of such an approach in terms of explainability. Channel combination-based VAD+OSD systems are evaluated on the final back-end task, i.e. speaker diarization, and show significant improvements. Finally, since multi-channel systems are trained given a fixed array configuration, they may fail in generalizing to other array set-ups, e.g. mismatched number of microphones. A channel-number invariant loss is proposed to learn a unique feature representation regardless of the number of available microphones. The evaluation conducted on mismatched array configurations highlights the robustness of this training strategy.
翻译:语音活动检测(VAD)和重叠语音检测(OSD)是说话人日志系统的关键预处理任务。在会议场景中,通常更容易通过远场设备捕获语音,但这一设定会导致严重的性能下降。本文研究了一种统一的监督学习框架,用于解决远场多麦克风协同的VAD与OSD联合检测问题(VAD+OSD)。文中探讨了多种基于通道加权与组合的多通道VAD+OSD前端处理方法,并提出了三种基于已有文献中提出的自注意力通道组合器(Self-Attention Channel Combinator, SACC)算法。在AMI会议语料库上的实验表明,通道组合方法在远场语音场景中显著提升了VAD+OSD性能。本文重点探索了学习型复数组合权重的使用,并揭示了该方法在可解释性方面的优势。基于通道组合的VAD+OSD系统在后端任务(即说话人日志)评估中展现了显著性能提升。此外,由于多通道系统需要针对固定阵列配置进行训练,其泛化能力可能受限于其他阵列设置(例如麦克风数量不匹配)。本文提出了一种通道数量不变性损失函数,可在不同麦克风数量条件下学习统一的特征表示。在非匹配阵列配置下的评估验证了该训练策略的鲁棒性。