The scarcity of labeled audio-visual datasets is a constraint for training superior audio-visual speaker diarization systems. To improve the performance of audio-visual speaker diarization, we leverage pre-trained supervised and self-supervised speech models for audio-visual speaker diarization. Specifically, we adopt supervised~(ResNet and ECAPA-TDNN) and self-supervised pre-trained models~(WavLM and HuBERT) as the speaker and audio embedding extractors in an end-to-end audio-visual speaker diarization~(AVSD) system. Then we explore the effectiveness of different frameworks, including Transformer, Conformer, and cross-attention mechanism, in the audio-visual decoder. To mitigate the degradation of performance caused by separate training, we jointly train the audio encoder, speaker encoder, and audio-visual decoder in the AVSD system. Experiments on the MISP dataset demonstrate that the proposed method achieves superior performance and obtained third place in MISP Challenge 2022.
翻译:标注音视频数据集的稀缺性制约了高性能音视频说话人日志系统的开发。为提升音视频说话人日志性能,本文利用预训练的监督式和自监督式语音模型进行音视频说话人日志。具体而言,我们在端到端音视频说话人日志(AVSD)系统中,采用监督式模型(ResNet与ECAPA-TDNN)和自监督预训练模型(WavLM与HuBERT)作为说话人与音频嵌入提取器。进而探索不同框架(包括Transformer、Conformer及交叉注意力机制)在音视频解码器中的有效性。为缓解独立训练导致的性能退化,我们联合训练AVSD系统中的音频编码器、说话人编码器及音视频解码器。在MISP数据集上的实验表明,所提方法取得了优越性能,并在MISP Challenge 2022中获得第三名。