STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment

Continuously learning a variety of audio-video semantics over time is crucial for audio-related reasoning tasks in our ever-evolving world. However, this is a nontrivial problem and poses two critical challenges: sparse spatio-temporal correlation between audio-video pairs and multimodal correlation overwriting that forgets audio-video relations. To tackle this problem, we propose a new continual audio-video pre-training method with two novel ideas: (1) Localized Patch Importance Scoring: we introduce a multimodal encoder to determine the importance score for each patch, emphasizing semantically intertwined audio-video patches. (2) Replay-guided Correlation Assessment: to reduce the corruption of previously learned audiovisual knowledge due to drift, we propose to assess the correlation of the current patches on the past steps to identify the patches exhibiting high correlations with the past steps. Based on the results from the two ideas, we perform probabilistic patch selection for effective continual audio-video pre-training. Experimental validation on multiple benchmarks shows that our method achieves a 3.69%p of relative performance gain in zero-shot retrieval tasks compared to strong continual learning baselines, while reducing memory consumption by ~45%.

翻译：持续学习随时间演变的多种音频-视频语义对于不断发展世界中的音频推理任务至关重要。然而，这是一个非平凡问题，并带来了两个关键挑战：音频-视频对之间的稀疏时空关联，以及因音频-视频关系遗忘而导致的多元关联覆盖。为解决此问题，我们提出一种新型持续音频-视频预训练方法，包含两个创新点：（1）局部块重要性评分：引入多模态编码器为每个块计算重要性分数，重点突出语义交织的音频-视频块。（2）重放引导相关性评估：为减少因漂移导致的先前音频-视觉知识损坏，我们通过评估当前块与历史步骤的相关性，识别与历史步骤高度相关的块。基于这两项结果，我们实施概率性块选择以实现高效持续音频-视频预训练。多个基准实验表明，与强持续学习基线相比，本方法在零样本检索任务中实现3.69%p的相对性能提升，同时内存消耗降低约45%。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日