STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.

翻译：在不断发展的世界中，持续学习多样化的音视频语义对于音频相关推理任务至关重要。然而，这是一个重要且具有挑战性的问题，主要面临两大难题：音视频对之间稀疏的时空相关性，以及多模态相关性覆盖导致的音视频关系遗忘。为解决这一问题，我们提出一种新的持续音视频预训练方法，其包含两项创新思路：（1）局部化补丁重要性评分：我们引入一个多模态编码器来确定每个补丁的重要性分数，重点关注语义交织的音视频补丁。（2）回放引导的相关性评估：为减少因特征漂移导致的已学习视听知识损坏，我们提出通过评估当前补丁与历史步骤的相关性，识别出与历史步骤具有高度相关性的补丁。基于这两项思路的结果，我们执行概率化补丁选择以实现高效的持续音视频预训练。在多个基准测试上的实验验证表明，相较于强持续学习基线方法，我们的方法在零样本检索任务中实现了3.69%的相对性能提升，同时内存消耗降低了约45%。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日