Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks.
翻译:近期自监督音视频表征学习的进展已证明其能够捕获丰富而全面的表示。然而,尽管数据增强在许多学习方法中已被验证具有优势,音视频学习却难以充分受益于这些增强技术,因为增强处理容易破坏输入对之间的对应关系。为解决这一局限,我们提出EquiAV——一种利用等变性进行音视频对比学习的新型框架。本方法首先通过共享的基于注意力机制的变换预测器,将等变性扩展至音视频学习领域。该预测器能够将来自不同增强的特征聚合为代表性嵌入,提供鲁棒的监督信号。值得注意的是,这一过程仅需极少计算开销。大量消融实验与定性结果验证了本方法的有效性。EquiAV在多项音视频基准测试中均超越了现有工作。