The Multi-modal Information based Speech Processing (MISP) challenge aims to extend the application of signal processing technology in specific scenarios by promoting the research into wake-up words, speaker diarization, speech recognition, and other technologies. The MISP2022 challenge has two tracks: 1) audio-visual speaker diarization (AVSD), aiming to solve ``who spoken when'' using both audio and visual data; 2) a novel audio-visual diarization and recognition (AVDR) task that focuses on addressing ``who spoken what when'' with audio-visual speaker diarization results. Both tracks focus on the Chinese language, and use far-field audio and video in real home-tv scenarios: 2-6 people communicating each other with TV noise in the background. This paper introduces the dataset, track settings, and baselines of the MISP2022 challenge. Our analyses of experiments and examples indicate the good performance of AVDR baseline system, and the potential difficulties in this challenge due to, e.g., the far-field video quality, the presence of TV noise in the background, and the indistinguishable speakers.
翻译:基于多模态信息的语音处理(MISP)挑战赛旨在通过推动唤醒词、说话人日志、语音识别等技术的研发,拓展信号处理技术在特定场景中的应用。MISP2022挑战赛包含两个赛道:1)音视频说话人日志(AVSD)任务,旨在利用音频和视频数据解决“谁在何时说话”的问题;2)新型音视频日志与识别(AVDR)任务,重点在于结合音视频说话人日志结果解决“谁在何时说了什么”的问题。两个赛道均以中文为研究对象,并采用真实家居电视场景中的远场音视频数据:2-6人在背景伴有电视噪声的环境下进行交流。本文介绍了MISP2022挑战赛的数据集、赛道设置及基线系统。我们的实验分析与案例表明,AVDR基线系统表现良好,但挑战中存在潜在难点,例如远场视频质量、背景电视噪声以及说话人难以区分等问题。