Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments shows that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets.
翻译:公式驱动监督学习(FDSL)是一种依赖数学公式(如分形)生成的合成图像进行预训练的方法。先前关于FDSL的研究表明,在此类合成数据集上预训练视觉Transformer能在广泛下游任务上取得具有竞争力的精度。这些合成图像根据生成它们的数学公式中的参数进行分类。本研究假设,在FDSL中为同一类别生成不同实例的过程可视为一种数据增强形式。我们通过用数据增强替代实例来验证这一假设,意味着每个类别仅需一张图像。实验表明,这种单实例分形数据库(OFDB)的性能优于显式生成实例的原始数据集。我们进一步将OFDB扩展到21,000个类别,并证明其在ImageNet-1k微调中的表现匹配甚至超越基于ImageNet-21k预训练的模型。OFDB的图像数量为21k,而ImageNet-21k包含1400万张图像。这为使用更小数据集预训练视觉Transformer开辟了全新可能性。