We propose a modular pipeline for the single-channel separation, recognition, and diarization of meeting-style recordings and evaluate it on the Libri-CSS dataset. Using a Continuous Speech Separation (CSS) system with a TF-GridNet separation architecture, followed by a speaker-agnostic speech recognizer, we achieve state-of-the-art recognition performance in terms of Optimal Reference Combination Word Error Rate (ORC WER). Then, a d-vector-based diarization module is employed to extract speaker embeddings from the enhanced signals and to assign the CSS outputs to the correct speaker. Here, we propose a syntactically informed diarization using sentence- and word-level boundaries of the ASR module to support speaker turn detection. This results in a state-of-the-art Concatenated minimum-Permutation Word Error Rate (cpWER) for the full meeting recognition pipeline.
翻译:我们提出了一种模块化流水线,用于单通道会议录音的分离、识别和说话人日志,并在Libri-CSS数据集上进行了评估。采用基于TF-GridNet分离架构的连续语音分离(CSS)系统,随后结合说话人无关的语音识别器,我们在最佳参考组合词错误率(ORC WER)指标上实现了最先进的识别性能。接着,使用基于d-vector的说话人日志模块从增强信号中提取说话人嵌入,并将CSS输出分配给正确的说话人。在此,我们提出了一种基于句法信息的说话人日志方法,利用ASR模块的句子级和单词级边界来支持说话人切换检测。这使完整的会议识别流水线在拼接最小排列词错误率(cpWER)上达到了最先进水平。