Vision Transformers (ViTs) have become ubiquitous in computer vision. Despite their success, ViTs lack inductive biases, which can make it difficult to train them with limited data. To address this challenge, prior studies suggest training ViTs with self-supervised learning (SSL) and fine-tuning sequentially. However, we observe that jointly optimizing ViTs for the primary task and a Self-Supervised Auxiliary Task (SSAT) is surprisingly beneficial when the amount of training data is limited. We explore the appropriate SSL tasks that can be optimized alongside the primary task, the training schemes for these tasks, and the data scale at which they can be most effective. Our findings reveal that SSAT is a powerful technique that enables ViTs to leverage the unique characteristics of both the self-supervised and primary tasks, achieving better performance than typical ViTs pre-training with SSL and fine-tuning sequentially. Our experiments, conducted on 10 datasets, demonstrate that SSAT significantly improves ViT performance while reducing carbon footprint. We also confirm the effectiveness of SSAT in the video domain for deepfake detection, showcasing its generalizability. Our code is available at https://github.com/dominickrei/Limited-data-vits.
翻译:视觉Transformer(ViT)在计算机视觉领域已变得无处不在。尽管取得了成功,ViT缺乏归纳偏置,这使其难以在有限数据下进行训练。为解决这一挑战,先前研究建议通过自监督学习(SSL)训练ViT并依次微调。然而,我们观察到当训练数据量有限时,联合优化ViT的主任务与自监督辅助任务(SSAT)具有出人意料的益处。我们探索了可与主任务共同优化的适当SSL任务、这些任务的训练方案,以及它们能发挥最大效用的数据规模。我们的发现表明,SSAT是一种强大的技术,能使ViT同时利用自监督和主任务的独特特性,取得优于传统ViT先通过SSL预训练再依次微调的性能。我们在10个数据集上进行的实验表明,SSAT在显著提升ViT性能的同时减少了碳足迹。我们还验证了SSAT在视频领域(用于深度伪造检测)的有效性,展示了其泛化能力。我们的代码可在https://github.com/dominickrei/Limited-data-vits获取。