Self-supervised speech pre-training methods have developed rapidly in recent years, which show to be very effective for many near-field single-channel speech tasks. However, far-field multichannel speech processing is suffering from the scarcity of labeled multichannel data and complex ambient noises. The efficacy of self-supervised learning for far-field multichannel and multi-modal speech processing has not been well explored. Considering that visual information helps to improve speech recognition performance in noisy scenes, in this work we propose a multichannel multi-modal speech self-supervised learning framework AV-wav2vec2, which utilizes video and multichannel audio data as inputs. First, we propose a multi-path structure to process multichannel audio streams and a visual stream in parallel, with intra- and inter-channel contrastive losses as training targets to fully exploit the spatiotemporal information in multichannel speech data. Second, based on contrastive learning, we use additional single-channel audio data, which is trained jointly to improve the performance of speech representation. Finally, we use a Chinese multichannel multi-modal dataset in real scenarios to validate the effectiveness of the proposed method on audio-visual speech recognition (AVSR), automatic speech recognition (ASR), visual speech recognition (VSR) and audio-visual speaker diarization (AVSD) tasks.
翻译:自监督语音预训练方法近年来发展迅速,在近场单通道语音任务中展现出极高的有效性。然而,远场多通道语音处理仍面临标注多通道数据稀缺与复杂环境噪声的双重挑战。自监督学习在远场多通道及多模态语音处理中的效能尚未得到充分探索。鉴于视觉信息有助于提升噪声场景下的语音识别性能,本文提出一种多通道多模态语音自监督学习框架AV-wav2vec2,该框架以视频与多通道音频数据为输入。首先,我们设计多路径结构并行处理多通道音频流与视觉流,以通道内和通道间对比损失作为训练目标,充分挖掘多通道语音数据中的时空信息。其次,基于对比学习方法,引入额外单通道音频数据进行联合训练,以提升语音表征性能。最后,采用真实场景下的中文多通道多模态数据集,在视听语音识别(AVSR)、自动语音识别(ASR)、视觉语音识别(VSR)及视听说话人日记化(AVSD)任务中验证了所提方法的有效性。