A robust multichannel speaker diarization and separation system is proposed by exploiting the spatio-temporal activity of the speakers. The system is realized in a hybrid architecture that combines the array signal processing units and the deep learning units. For speaker diarization, a spatial coherence matrix across time frames is computed based on the whitened relative transfer functions (wRTFs) of the microphone array. This serves as a robust feature for subsequent machine learning without the need for prior knowledge of the array configuration. A computationally efficient Spatial Activity-driven Speaker Diarization network (SASDnet) is constructed to estimate the speaker activity directly from the spatial coherence matrix. For speaker separation, we propose the Global and Local Activity-driven Speaker Extraction network (GLASEnet) to separate speaker signals via speaker-specific global and local spatial activity functions. The local spatial activity functions depend on the coherence between the wRTFs of each time-frequency bin and the target speaker-dominant bins. The global spatial activity functions are computed from the global spatial coherence functions based on frequency-averaged local spatial activity functions. Experimental results have demonstrated superior speaker, diarization, counting, and separation performance achieved by the proposed system with low computational complexity compared to the pre-selected baselines.
翻译:本文提出一种稳健的多通道说话人日记化与分离系统,通过利用说话人的时空活动信息实现。该系统采用混合架构,融合了阵列信号处理单元与深度学习单元。在说话人日记化方面,基于麦克风阵列的归一化相对传递函数(wRTFs)计算跨时间帧的空间相干矩阵,该矩阵无需阵列配置先验知识即可作为后续机器学习的鲁棒特征。构建了一种计算高效的时空活动驱动说话人日记化网络(SASDnet),直接从空间相干矩阵估计说话人活动。在说话人分离方面,我们提出全局与局部活动驱动的说话人提取网络(GLASEnet),通过说话人特定的全局和局部空间活动函数分离说话人信号。局部空间活动函数依赖于各时频单元wRTF与目标说话人主导单元之间的相干性,全局空间活动函数则基于频率平均后的局部空间活动函数计算全局空间相干函数获得。实验结果表明,与所选基准方法相比,所提系统在较低计算复杂度下实现了更优的说话人识别、日记化、计数及分离性能。