This study evaluates the trade-offs between convolutional and transformer-based architectures on both medical and general-purpose image classification benchmarks. We use ResNet-18 as our baseline and introduce a fine-tuning strategy applied to four Vision Transformer variants (Tiny, Small, Base, Large) on DermatologyMNIST and TinyImageNet. Our goal is to reduce inference latency and model complexity with acceptable accuracy degradation. Through systematic hyperparameter variations, we demonstrate that appropriately fine-tuned Vision Transformers can match or exceed the baseline's performance, achieve faster inference, and operate with fewer parameters, highlighting their viability for deployment in resource-constrained environments.
翻译:本研究评估了卷积架构与基于Transformer的架构在医学和通用图像分类基准上的权衡。我们以ResNet-18作为基线模型,并在DermatologyMNIST和TinyImageNet数据集上对四种Vision Transformer变体(Tiny、Small、Base、Large)应用微调策略。我们的目标是在可接受的精度损失下降低推理延迟和模型复杂度。通过系统性的超参数调整,我们证明经过适当微调的Vision Transformer能够匹配或超越基线模型的性能,实现更快的推理速度,并以更少的参数量运行,这凸显了其在资源受限环境中部署的可行性。