This paper presents the CHiME-8 DASR challenge which carries on from the previous edition CHiME-7 DASR (C7DASR) and the past CHiME-6 challenge. It focuses on joint multi-channel distant speech recognition (DASR) and diarization with one or more, possibly heterogeneous, devices. The main goal is to spur research towards meeting transcription approaches that can generalize across arbitrary number of speakers, diverse settings (formal vs. informal conversations), meeting duration, wide-variety of acoustic scenarios and different recording configurations. Novelties with respect to C7DASR include: i) the addition of NOTSOFAR-1, an additional office/corporate meeting scenario, ii) a manually corrected Mixer 6 development set, iii) a new track in which we allow the use of large-language models (LLM) iv) a jury award mechanism to encourage participants to explore also more practical and innovative solutions. To lower the entry barrier for participants, we provide a standalone toolkit for downloading and preparing such datasets as well as performing text normalization and scoring their submissions. Furthermore, this year we also provide two baseline systems, one directly inherited from C7DASR and based on ESPnet and another one developed on NeMo and based on NeMo team submission in last year C7DASR. Baseline system results suggest that the addition of the NOTSOFAR-1 scenario significantly increases the task's difficulty due to its high number of speakers and very short duration.
翻译:本文介绍了CHiME-8 DASR挑战赛,该赛事延续了前一届CHiME-7 DASR(C7DASR)及过往CHiME-6挑战赛的脉络。其核心聚焦于使用一个或多个可能异构的设备,进行联合多通道远场语音识别(DASR)与说话人日志。主要目标是推动研究,发展能够泛化于任意说话者数量、多样化场景(正式与非正式对话)、会议时长、广泛声学条件以及不同录音配置的会议转录方法。相较于C7DASR,本届挑战赛的新颖之处包括:i)新增NOTSOFAR-1场景,这是一个额外的办公室/企业会议场景;ii)提供了经过人工校正的Mixer 6开发集;iii)设立了一个允许使用大语言模型(LLM)的新赛道;iv)引入了评审团奖项机制,以鼓励参与者探索更具实用性和创新性的解决方案。为降低参与门槛,我们提供了一个独立的工具包,用于下载和准备相关数据集,执行文本规范化,并对提交结果进行评分。此外,今年我们还提供了两个基线系统:一个直接继承自C7DASR并基于ESPnet框架,另一个则基于NeMo框架开发,源自NeMo团队在去年C7DASR中的提交方案。基线系统的结果表明,由于NOTSOFAR-1场景中说话者数量众多且会议时长极短,其加入显著增加了任务的难度。