Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
翻译:当前自监督学习算法通常具有模态特异性,且需要大量计算资源。为解决这些问题,我们提高了能跨多种模态泛化的学习目标data2vec的训练效率。我们不对掩码令牌进行编码,采用快速卷积解码器,并分摊构建教师表示的计算成本。data2vec 2.0受益于data2vec引入的丰富上下文目标表示,从而实现了快速自监督学习。在ImageNet-1K图像分类任务上的实验表明,data2vec 2.0以16.4倍的预训练时间缩减达到与掩码自编码器相当的准确率;在Librispeech语音识别任务中,它以10.6倍的时间缩减达到wav2vec 2.0的性能;在GLUE自然语言理解任务中,它仅用一半时间即可匹配重新训练的RoBERTa模型。通过牺牲部分速度换取精度,使用ViT-L模型训练150个周期后,在ImageNet-1K上实现了86.8%的top-1准确率。