Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.
翻译:在不断发展的世界中,持续学习多样化的音视频语义对于音频相关推理任务至关重要。然而,这是一个重要且具有挑战性的问题,主要面临两大难题:音视频对之间稀疏的时空相关性,以及多模态相关性覆盖导致的音视频关系遗忘。为解决这一问题,我们提出一种新的持续音视频预训练方法,其包含两项创新思路:(1)局部化补丁重要性评分:我们引入一个多模态编码器来确定每个补丁的重要性分数,重点关注语义交织的音视频补丁。(2)回放引导的相关性评估:为减少因特征漂移导致的已学习视听知识损坏,我们提出通过评估当前补丁与历史步骤的相关性,识别出与历史步骤具有高度相关性的补丁。基于这两项思路的结果,我们执行概率化补丁选择以实现高效的持续音视频预训练。在多个基准测试上的实验验证表明,相较于强持续学习基线方法,我们的方法在零样本检索任务中实现了3.69%的相对性能提升,同时内存消耗降低了约45%。