Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE
翻译:掩码自编码器(MAEs)能够从无标签数据中学习丰富的底层表示,但在下游任务的有效适应上严重依赖大量标注数据。相反,实例判别(ID)强调高层语义,为缓解MAEs的标注需求提供了潜在解决方案。尽管结合这两种方法能够在有限标注数据下应对下游任务,但将ID直接整合到MAEs中会导致训练时间延长和计算成本升高。为解决这一挑战,我们提出uaMix-MAE——一种利用无监督音频混合的高效ID调优策略。通过对比调优,uaMix-MAE将预训练MAEs的表示进行对齐,从而促进其向任务特定语义的有效适应。为了用少量无标签数据优化模型,我们提出一种音频混合技术,可在输入空间和虚拟标签空间同时操作音频样本。在低样本/小样本设置下的实验表明,当使用有限无标签数据(如AudioSet-20K)进行调优时,我们的模型在多个基准测试上实现了4-6%的准确率提升。代码已开源:https://github.com/PLAN-Lab/uamix-MAE