We propose a diarization system, that estimates "who spoke when" based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones' positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers' activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker's signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use of external diarization information.
翻译:本文提出一种基于空间信息的语者分割系统,用于估计“谁在何时说话”,并将其作为基于声学传感器网络(ASN)采集信号的会议转录系统的前端模块。尽管麦克风的空间分布具有优势,但由于麦克风位置通常未知且记录信号初始时一般不同步,如何利用空间多样性进行语者分割和信号增强仍面临挑战。为此,我们首先通过盲同步处理信号,随后估计到达时间差(TDOA)。TDOA信息被用于估计语者活动状态,即使存在多语者同时说话的情况也能有效识别。该语者活动信息可引导空间混合模型,并基于此通过波束形成提取各语者信号。最终将提取的信号传输至语音识别器。此外,本文提出一种基于TDOA估计的空间混合模型新型初始化方案。在LibriWASN数据集真实录音上的实验表明,与未利用外部语者分割信息的空间混合模型系统相比,本系统具有显著优势。