In speaker tracking research, integrating and complementing multi-modal data is a crucial strategy for improving the accuracy and robustness of tracking systems. However, tracking with incomplete modalities remains a challenging issue due to noisy observations caused by occlusion, acoustic noise, and sensor failures. Especially when there is missing data in multiple modalities, the performance of existing multi-modal fusion methods tends to decrease. To this end, we propose a Global-Local Distillation-based Tracker (GLDTracker) for robust audio-visual speaker tracking. GLDTracker is driven by a teacher-student distillation model, enabling the flexible fusion of incomplete information from each modality. The teacher network processes global signals captured by camera and microphone arrays, and the student network handles local information subject to visual occlusion and missing audio channels. By transferring knowledge from teacher to student, the student network can better adapt to complex dynamic scenes with incomplete observations. In the student network, a global feature reconstruction module based on the generative adversarial network is constructed to reconstruct global features from feature embedding with missing local information. Furthermore, a multi-modal multi-level fusion attention is introduced to integrate the incomplete feature and the reconstructed feature, leveraging the complementarity and consistency of audio-visual and global-local features. Experimental results on the AV16.3 dataset demonstrate that the proposed GLDTracker outperforms existing state-of-the-art audio-visual trackers and achieves leading performance on both standard and incomplete modalities datasets, highlighting its superiority and robustness in complex conditions. The code and models will be available.
翻译:在说话人跟踪研究中,整合与互补多模态数据是提升跟踪系统精度与鲁棒性的关键策略。然而,由于遮挡、声学噪声及传感器故障导致的观测噪声,不完备模态下的跟踪仍是一个具有挑战性的问题。特别是在多模态数据缺失的情况下,现有多模态融合方法的性能往往显著下降。为此,我们提出了一种基于全局-局部蒸馏的跟踪器(GLDTracker),用于实现鲁棒的音频-视觉说话人跟踪。GLDTracker采用师生蒸馏模型驱动,能够灵活融合来自各模态的不完备信息。教师网络处理由摄像头与麦克风阵列捕获的全局信号,而学生网络则处理受视觉遮挡与音频通道缺失影响的局部信息。通过从教师网络向学生网络迁移知识,学生网络能够更好地适应具有不完备观测的复杂动态场景。在学生网络中,构建了一个基于生成对抗网络的全局特征重建模块,用于从缺失局部信息的特征嵌入中重建全局特征。此外,引入了一种多模态多层次融合注意力机制,以整合不完备特征与重建特征,充分利用音频-视觉与全局-局部特征的互补性与一致性。在AV16.3数据集上的实验结果表明,所提出的GLDTracker优于现有的先进音频-视觉跟踪器,并在标准与不完备模态数据集上均取得了领先性能,凸显了其在复杂条件下的优越性与鲁棒性。代码与模型将公开提供。