VariViT: A Vision Transformer for Variable Image Sizes

Vision Transformers (ViTs) have emerged as the state-of-the-art architecture in representation learning, leveraging self-attention mechanisms to excel in various tasks. ViTs split images into fixed-size patches, constraining them to a predefined size and necessitating pre-processing steps like resizing, padding, or cropping. This poses challenges in medical imaging, particularly with irregularly shaped structures like tumors. A fixed bounding box crop size produces input images with highly variable foreground-to-background ratios. Resizing medical images can degrade information and introduce artefacts, impacting diagnosis. Hence, tailoring variable-sized crops to regions of interest can enhance feature representation capabilities. Moreover, large images are computationally expensive, and smaller sizes risk information loss, presenting a computation-accuracy tradeoff. We propose VariViT, an improved ViT model crafted to handle variable image sizes while maintaining a consistent patch size. VariViT employs a novel positional embedding resizing scheme for a variable number of patches. We also implement a new batching strategy within VariViT to reduce computational complexity, resulting in faster training and inference times. In our evaluations on two 3D brain MRI datasets, VariViT surpasses vanilla ViTs and ResNet in glioma genotype prediction and brain tumor classification. It achieves F1-scores of 75.5% and 76.3%, respectively, learning more discriminative features. Our proposed batching strategy reduces computation time by up to 30% compared to conventional architectures. These findings underscore the efficacy of VariViT in image representation learning. Our code can be found here: https://github.com/Aswathi-Varma/varivit

翻译：视觉Transformer（ViT）已成为表征学习领域最先进的架构，其利用自注意力机制在各类任务中表现出色。ViT将图像分割为固定尺寸的图像块，这要求图像必须具有预定义尺寸，并需要进行诸如调整大小、填充或裁剪等预处理步骤。这在医学影像领域带来挑战，尤其对于肿瘤等不规则形状的结构。固定边界框裁剪尺寸会产生前景与背景比例高度可变的输入图像。调整医学影像尺寸可能损失信息并引入伪影，从而影响诊断。因此，针对感兴趣区域定制可变尺寸裁剪可增强特征表征能力。此外，大尺寸图像计算成本高昂，而较小尺寸则存在信息丢失风险，这构成了计算效率与准确性的权衡。我们提出VariViT——一种改进的ViT模型，专为处理可变图像尺寸而设计，同时保持一致的图像块尺寸。VariViT采用新颖的位置编码调整方案以适应可变数量的图像块。我们还在VariViT中实现了新的批处理策略以降低计算复杂度，从而缩短训练和推理时间。我们在两个3D脑部MRI数据集上的评估表明，VariViT在胶质瘤基因型预测和脑肿瘤分类任务中均优于原始ViT和ResNet，分别获得75.5%和76.3%的F1分数，能够学习更具判别性的特征。与传统架构相比，我们提出的批处理策略可减少高达30%的计算时间。这些发现证明了VariViT在图像表征学习中的有效性。我们的代码可见：https://github.com/Aswathi-Varma/varivit