We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.
翻译:我们提出了一种模块化流水线,用于单声道会议录音的分离、识别和说话人分割,并在Libri-CSS数据集上进行了评估。采用基于TF-GridNet分离架构的连续语音分离(CSS)系统,随后使用说话人无关的语音识别器,在最优参考组合词错误率(ORC WER)方面实现了最先进的识别性能。接着,利用基于d-vector的说话人分割模块,从增强信号中提取说话人嵌入,并将CSS输出分配给正确的说话人。在此,我们提出了一种句法驱动的说话人分割方法,利用ASR模块的句子级和词级边界来支持说话人切换检测。最终,完整会议识别流水线的拼接最小词错误率(cpWER)达到了最先进水平。