Past studies on end-to-end meeting transcription have focused on model architecture and have mostly been evaluated on simulated meeting data. We present a novel study aiming to optimize the use of a Speaker-Attributed ASR (SA-ASR) system in real-life scenarios, such as the AMI meeting corpus, for improved speaker assignment of speech segments. First, we propose a pipeline tailored to real-life applications involving Voice Activity Detection (VAD), Speaker Diarization (SD), and SA-ASR. Second, we advocate using VAD output segments to fine-tune the SA-ASR model, considering that it is also applied to VAD segments during test, and show that this results in a relative reduction of Speaker Error Rate (SER) up to 28%. Finally, we explore strategies to enhance the extraction of the speaker embedding templates used as inputs by the SA-ASR system. We show that extracting them from SD output rather than annotated speaker segments results in a relative SER reduction up to 20%.
翻译:以往关于端到端会议转录的研究主要聚焦于模型架构,且大多在模拟会议数据上进行评估。本文提出了一项新颖研究,旨在优化说话人属性语音识别(SA-ASR)系统在真实场景(如AMI会议语料库)中的使用,从而改进语音片段的说话人分配。首先,我们设计了一个针对真实应用场景的流水线,涵盖语音活动检测(VAD)、说话人日志(SD)和SA-ASR。其次,鉴于测试时系统同样应用于VAD输出片段,我们主张使用VAD输出片段对SA-ASR模型进行微调,实验表明该方法可使说话人错误率(SER)相对降低高达28%。最后,我们探索了增强SA-ASR系统输入说话人嵌入模板提取的策略。实验表明,从SD输出而非人工标注的说话人片段中提取嵌入模板,可使SER相对降低20%。