Audio-visual target speech extraction, which aims to extract a certain speaker's speech from the noisy mixture by looking at lip movements, has made significant progress combining time-domain speech separation models and visual feature extractors (CNN). One problem of fusing audio and video information is that they have different time resolutions. Most current research upsamples the visual features along the time dimension so that audio and video features are able to align in time. However, we believe that lip movement should mostly contain long-term, or phone-level information. Based on this assumption, we propose a new way to fuse audio-visual features. We observe that for DPRNN \cite{dprnn}, the interchunk dimension's time resolution could be very close to the time resolution of video frames. Like \cite{sepformer}, the LSTM in DPRNN is replaced by intra-chunk and inter-chunk self-attention, but in the proposed algorithm, inter-chunk attention incorporates the visual features as an additional feature stream. This prevents the upsampling of visual cues, resulting in more efficient audio-visual fusion. The result shows we achieve superior results compared with other time-domain based audio-visual fusion models.
翻译:视听目标语音提取旨在通过观察唇部运动,从含噪混合信号中提取特定说话人的语音。结合时域语音分离模型与视觉特征提取器(CNN)后,该技术取得了显著进展。融合音频与视频信息的一个核心挑战在于两者时间分辨率存在差异。当前多数研究通过沿时间维度上采样视觉特征,实现音频与视频特征的时间对齐。然而,我们认为唇部运动应主要包含长期信息(即音素级信息)。基于这一假设,我们提出一种视听特征融合的新方法。研究发现,在双路径循环神经网络(DPRNN)中,跨块维度的时域分辨率与视频帧的时域分辨率高度接近。参考SepFormer的设计,我们将DPRNN中的LSTM替换为块内自注意力与跨块自注意力机制。但在本算法中,跨块注意力将视觉特征作为附加特征流融入计算,从而避免对视觉得进行上采样,实现更高效的视听融合。实验结果表明,与其他基于时域的视听融合模型相比,本方法取得了更优的性能。