Vision Transformers (ViTs) have demonstrated exceptional performance in various vision tasks. However, they tend to underperform on smaller datasets due to their inherent lack of inductive biases. Current approaches address this limitation implicitly-often by pairing ViTs with pretext tasks or by distilling knowledge from convolutional neural networks (CNNs) to strengthen the prior. In contrast, Self-Organizing Maps (SOMs), a widely adopted self-supervised framework, are inherently structured to preserve topology and spatial organization, making them a promising candidate to directly address the limitations of ViTs in limited or small training datasets. Despite this potential, equipping SOMs with modern deep learning architectures remains largely unexplored. In this study, we conduct a novel exploration on how Vision Transformers (ViTs) and Self-Organizing Maps (SOMs) can empower each other, aiming to bridge this critical research gap. Our findings demonstrate that these architectures can synergistically enhance each other, leading to significantly improved performance in both unsupervised and supervised tasks. Code is publicly available on GitHub.
翻译:视觉Transformer(ViTs)已在多种视觉任务中展现出卓越性能。然而,由于其固有的归纳偏置缺失,ViTs在较小数据集上往往表现欠佳。现有方法通常通过将ViTs与预训练任务结合,或从卷积神经网络(CNNs)中蒸馏知识以增强先验,从而隐式地解决这一局限。相比之下,自组织映射(SOMs)作为一种广泛采用的自监督框架,其固有结构能够保持拓扑与空间组织特性,使其成为直接解决ViTs在有限或小型训练数据集中局限性的潜在方案。尽管具备此潜力,为SOMs配备现代深度学习架构的研究仍基本处于空白。本研究对视觉Transformer(ViTs)与自组织映射(SOMs)如何相互增强进行了创新性探索,旨在弥合这一关键研究缺口。我们的实验结果表明,这两种架构能够产生协同增强效应,在无监督与有监督任务中均实现性能的显著提升。代码已在GitHub公开。