Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.
翻译:公式驱动监督学习(FDSL)是一种依赖数学公式(如分形)生成的合成图像进行预训练的方法。先前关于FDSL的研究表明,在此类合成数据集上预训练的视觉Transformer能够在一系列下游任务中取得具有竞争力的精度。这些合成图像根据生成它们所使用的数学公式参数进行分类。在本研究中,我们提出假设:在FDSL中,为同一类别生成不同实例的过程可视为一种数据增强形式。我们通过用数据增强替代这些实例来验证该假设,这意味着每个类别仅需一张图像。实验表明,这种单实例分形数据库(OFDB)的表现优于显式生成实例的原始数据集。我们进一步将OFDB扩展至21,000个类别,并证明在ImageNet-1k微调任务中,其性能与基于ImageNet-21k预训练的模型相当,甚至有所超越。OFDB包含21,000张图像,而ImageNet-21k包含1400万张。这为利用更小数据集进行视觉Transformer预训练开辟了新的可能性。