Lifelong Audio-video Masked Autoencoder with Forget-robust Localized Alignments

We present a lifelong audio-video masked autoencoder that continually learns the multimodal representations from a video stream containing audio-video pairs, while its distribution continually shifts over time. Specifically, we propose two novel ideas to tackle the problem: (1) Localized Alignment: We introduce a small trainable multimodal encoder that predicts the audio and video tokens that are well-aligned with each other. This allows the model to learn only the highly correlated audiovisual patches with accurate multimodal relationships. (2) Forget-robust multimodal patch selection: We compare the relative importance of each audio-video patch between the current and past data pair to mitigate unintended drift of the previously learned audio-video representations. Our proposed method, FLAVA (Forget-robust Localized Audio-Video Alignment), therefore, captures the complex relationships between the audio and video modalities during training on a sequence of pre-training tasks while alleviating the forgetting of learned audiovisual correlations. Our experiments validate that FLAVA outperforms the state-of-the-art continual learning methods on several benchmark datasets under continual audio-video representation learning scenarios.

翻译：我们提出了一种终身音视频掩码自编码器，能够从包含音视频对的视频流中持续学习多模态表征，同时其数据分布随时间不断漂移。具体而言，我们针对该问题提出了两项创新：(1) 局部对齐：引入一个可训练的小型多模态编码器，用于预测相互对齐良好的音频和视频标记。这使得模型仅需学习高度相关的音视频块，并建立准确的多模态关联。(2) 抗遗忘多模态块选择：通过比较当前与过去数据对中每个音视频块的相对重要性，缓解先前习得音视频表征的非预期漂移。我们提出的方法FLAVA（抗遗忘局部音视频对齐）在连续预训练任务序列中捕捉音频与视频模态间的复杂关系，同时缓解已学习音视频相关性的遗忘。实验证明，在连续音视频表征学习场景下，FLAVA在多个基准数据集上优于现有最先进的持续学习方法。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日