Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in extensive web-scale video datasets to achieve significant advancements. However, conventional contrastive audio-visual learning methodologies often rely on aggregated representations derived through temporal aggregation, which neglects the intrinsic sequential nature of the data. This oversight raises concerns regarding the ability of standard approaches to capture and utilize fine-grained information within sequences, information that is vital for distinguishing between semantically similar yet distinct examples. In response to this limitation, we propose sequential contrastive audio-visual learning (SCAV), which contrasts examples based on their non-aggregated representation space using sequential distances. Retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV, showing 2-3x relative improvements against traditional aggregation-based contrastive learning and other methods from the literature. We also show that models trained with SCAV exhibit a high degree of flexibility regarding the metric employed for retrieval, allowing them to operate on a spectrum of efficiency-accuracy trade-offs, potentially making them applicable in multiple scenarios, from small- to large-scale retrieval.
翻译:对比学习已成为视听表征学习中的一项强大技术,它利用大规模网络视频数据集中音频与视觉模态的自然共现性,取得了显著进展。然而,传统的对比视听学习方法通常依赖于通过时间聚合得到的聚合表征,忽略了数据固有的序列特性。这种忽视引发了对标准方法能否捕捉和利用序列内细粒度信息的担忧,而这些信息对于区分语义相似但实际不同的样本至关重要。针对这一局限,我们提出了序列对比视听学习(SCAV),该方法基于样本在非聚合表征空间中的序列距离进行对比。在VGGSound和Music数据集上的检索实验证明了SCAV的有效性,相较于传统的基于聚合的对比学习及其他文献方法,取得了2-3倍的相对性能提升。我们还发现,通过SCAV训练的模型在检索所用度量方面表现出高度灵活性,使其能够在效率与准确性的权衡谱系上运行,从而可能适用于从小规模到大规模检索的多种场景。