Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.
翻译:持续学习随时间演变的多种音频-视频语义对于不断发展世界中的音频推理任务至关重要。然而,这是一个非平凡问题,并带来了两个关键挑战:音频-视频对之间的稀疏时空关联,以及因音频-视频关系遗忘而导致的多元关联覆盖。为解决此问题,我们提出一种新型持续音频-视频预训练方法,包含两个创新点:(1)局部块重要性评分:引入多模态编码器为每个块计算重要性分数,重点突出语义交织的音频-视频块。(2)重放引导相关性评估:为减少因漂移导致的先前音频-视觉知识损坏,我们通过评估当前块与历史步骤的相关性,识别与历史步骤高度相关的块。基于这两项结果,我们实施概率性块选择以实现高效持续音频-视频预训练。多个基准实验表明,与强持续学习基线相比,本方法在零样本检索任务中实现3.69%p的相对性能提升,同时内存消耗降低约45%。