We propose Masked Capsule Autoencoders (MCAE), the first Capsule Network that utilises pretraining in a self-supervised manner. Capsule Networks have emerged as a powerful alternative to Convolutional Neural Networks (CNNs), and have shown favourable properties when compared to Vision Transformers (ViT), but have struggled to effectively learn when presented with more complex data, leading to Capsule Network models that do not scale to modern tasks. Our proposed MCAE model alleviates this issue by reformulating the Capsule Network to use masked image modelling as a pretraining stage before finetuning in a supervised manner. Across several experiments and ablations studies we demonstrate that similarly to CNNs and ViTs, Capsule Networks can also benefit from self-supervised pretraining, paving the way for further advancements in this neural network domain. For instance, pretraining on the Imagenette dataset, a dataset of 10 classes of Imagenet-sized images, we achieve not only state-of-the-art results for Capsule Networks but also a 9% improvement compared to purely supervised training. Thus we propose that Capsule Networks benefit from and should be trained within a masked image modelling framework, with a novel capsule decoder, to improve a Capsule Network's performance on realistic-sized images.
翻译:我们提出了掩码胶囊自编码器(MCAE),这是首个利用自监督方式进行预训练的胶囊网络。胶囊网络已成为卷积神经网络(CNN)的强大替代方案,并与视觉Transformer(ViT)相比展现出优越特性,但在处理更复杂数据时难以有效学习,导致胶囊网络模型无法扩展到现代任务。我们提出的MCAE模型通过将胶囊网络重新表述为使用掩码图像建模作为预训练阶段,随后进行监督微调,从而缓解了这一问题。通过多项实验和消融研究,我们证明与CNN和ViT类似,胶囊网络也能受益于自监督预训练,为该神经网络领域进一步的发展铺平了道路。例如,在Imagenette数据集(包含10类ImageNet大小图像的数据集)上进行预训练,我们不仅达到了胶囊网络的最优结果,而且相较于纯监督训练提升了9%。因此,我们建议胶囊网络应从掩码图像建模框架中获益,并应在此框架内训练,同时结合新型胶囊解码器,以提高胶囊网络在真实尺寸图像上的性能。