Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and model features from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model and fuse the multi-modal features. Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.
翻译:视觉信息可作为目标说话人提取(TSE)的有效线索,对提升提取性能至关重要。本文提出AV-SepFormer——一种基于SepFormer的注意力双尺度模型,利用自注意力和交叉注意力融合并建模音频与视觉特征。AV-SepFormer将音频特征分割为与视觉特征长度相等的若干块,进而采用自注意力和交叉注意力对多模态特征进行建模与融合。此外,我们采用一种新颖的二维位置编码,在块间与块内引入位置信息,相比传统位置编码带来了显著性能提升。本模型具有两大关键优势:音频分块特征的时间粒度与视觉特征同步,缓解了音视频采样率不一致造成的损害;通过结合自注意力和交叉注意力,特征融合与语音提取过程在注意力框架内实现了统一。实验结果表明,AV-SepFormer显著优于现有其他方法。