Vision-Transformers (ViTs) and Convolutional neural networks (CNNs) are widely used Deep Neural Networks (DNNs) for classification task. These model architectures are dependent on the number of classes in the dataset it was trained on. Any change in number of classes leads to change (partial or full) in the model's architecture. This work addresses the question: Is it possible to create a number-of-class-agnostic model architecture?. This allows model's architecture to be independent of the dataset it is trained on. This work highlights the issues with the current architectures (ViTs and CNNs). Also, proposes a training and inference framework OneCAD (One Classifier for All image Datasets) to achieve close-to number-of-class-agnostic transformer model. To best of our knowledge this is the first work to use Mask-Image-Modeling (MIM) with multimodal learning for classification task to create a DNN model architecture agnostic to the number of classes. Preliminary results are shown on natural and medical image datasets. Datasets: MNIST, CIFAR10, CIFAR100 and COVIDx. Code will soon be publicly available on github.
翻译:视觉变换器(Vision-Transformers, ViTs)和卷积神经网络(Convolutional Neural Networks, CNNs)是分类任务中广泛使用的深度神经网络(Deep Neural Networks, DNNs)。这些模型架构依赖于其训练数据集的类别数量。类别数量的任何变化都会导致模型架构的(部分或全部)变更。本研究探讨以下问题:是否可能构建一种对类别数量无关的模型架构?这将使模型架构独立于其训练数据集。本文指出了当前架构(ViTs和CNNs)存在的问题,并提出了一种训练与推理框架OneCAD(全图像数据集统一分类器),以实现接近类别数量无关的变换器模型。据我们所知,这是首次将掩码图像建模(Mask-Image-Modeling, MIM)与多模态学习结合用于分类任务,以构建对类别数量无关的DNN模型架构。初步结果在自然图像和医学图像数据集(MNIST、CIFAR10、CIFAR100及COVIDx)上进行了展示。代码将很快在github上公开。