Contrastive learning has emerged as a powerful technique in audio-visual representation learning, leveraging the natural co-occurrence of audio and visual modalities in webscale video datasets. However, conventional contrastive audio-visual learning (CAV) methodologies often rely on aggregated representations derived through temporal aggregation, neglecting the intrinsic sequential nature of the data. This oversight raises concerns regarding the ability of standard approaches to capture and utilize fine-grained information within sequences. In response to this limitation, we propose sequential contrastive audiovisual learning (SCAV), which contrasts examples based on their non-aggregated representation space using multidimensional sequential distances. Audio-visual retrieval experiments with the VGGSound and Music datasets demonstrate the effectiveness of SCAV, with up to 3.5x relative improvements in recall against traditional aggregation-based contrastive learning and other previously proposed methods, which utilize more parameters and data. We also show that models trained with SCAV exhibit a significant degree of flexibility regarding the metric employed for retrieval, allowing us to use a hybrid retrieval approach that is both effective and efficient.
翻译:对比学习已成为视听表征学习中的一项强大技术,它利用了网络规模视频数据集中音频与视觉模态自然共现的特性。然而,传统的对比视听学习方法通常依赖于通过时间聚合得到的聚合表征,忽视了数据内在的时序特性。这一疏忽引发了人们对标准方法在捕获和利用序列内细粒度信息能力方面的担忧。针对这一局限,我们提出了时序对比视听学习方法,该方法基于非聚合表征空间,利用多维时序距离进行样本对比。在VGGSound和Music数据集上进行的视听检索实验证明了SCAV的有效性,相较于传统的基于聚合的对比学习及其他先前提出的方法,其召回率相对提升最高达3.5倍,而后者使用了更多参数和数据。我们还表明,通过SCAV训练的模型在检索所用度量方面表现出显著的灵活性,这使我们能够采用一种既高效又有效的混合检索方法。