Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.
翻译:近年来,自监督音频-视觉表征学习的进展已证明其捕获丰富且全面表征的潜力。然而,尽管数据增强在许多学习方法中已被证实具有优势,音频-视觉学习却难以充分利用这些益处,因为增强操作极易破坏输入对之间的对应关系。为克服这一局限,我们提出了EquiAV,一个利用等变性进行音频-视觉对比学习的新颖框架。我们的方法首先将等变性扩展至音频-视觉学习,这通过一个基于共享注意力的变换预测器实现。该框架能够将来自不同数据增强的特征聚合为一个代表性嵌入,从而提供鲁棒的监督信号。值得注意的是,这一过程仅需极小的计算开销。大量的消融研究与定性结果验证了我们方法的有效性。EquiAV在多种音频-视觉基准测试中超越了先前的工作。代码公开于https://github.com/JongSuk1/EquiAV。