In short video and live broadcasts, speech, singing voice, and background music often overlap and obscure each other. This complexity creates difficulties in structuring and recognizing the audio content, which may impair subsequent ASR and music understanding applications. This paper proposes a multi-task audio source separation (MTASS) based ASR model called JRSV, which Jointly Recognizes Speech and singing Voices. Specifically, the MTASS module separates the mixed audio into distinct speech and singing voice tracks while removing background music. The CTC/attention hybrid recognition module recognizes both tracks. Online distillation is proposed to improve the robustness of recognition further. To evaluate the proposed methods, a benchmark dataset is constructed and released. Experimental results demonstrate that JRSV can significantly improve recognition accuracy on each track of the mixed audio.
翻译:在短视频和直播场景中,语音、歌声与背景音乐常相互重叠和干扰。这种复杂性给音频内容的结构化与识别带来困难,可能影响后续的语音识别(ASR)与音乐理解应用。本文提出一种基于多任务音频源分离(MTASS)的ASR模型JRSV,可联合识别语音与歌声。具体而言,MTASS模块将混合音频分离为独立的语音与歌声音轨,同时去除背景音乐;基于CTC/注意力机制的混合识别模块对两个音轨进行识别。为提升识别的鲁棒性,本文进一步提出在线蒸馏方法。为评估所提方法,我们构建并公开了一个基准数据集。实验结果表明,JRSV能显著提升混合音频各音轨的识别准确率。