ZACH-ViT: Regime-Dependent Inductive Bias in Compact Vision Transformers for Medical Imaging

Vision Transformers rely on positional embeddings and class tokens encoding fixed spatial priors. While effective for natural images, these priors may be suboptimal when spatial layout is weakly informative, a frequent condition in medical imaging. We introduce ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer), a compact Vision Transformer that removes positional embeddings and the [CLS] token, achieving permutation-invariant patch processing via global average pooling. Zero-token denotes removal of the dedicated aggregation token and positional encodings. Patch tokens remain unchanged. Adaptive residual projections preserve training stability under strict parameter constraints. We evaluate ZACH-ViT across seven MedMNIST datasets under a strict few-shot protocol (50 samples/class, fixed hyperparameters, five seeds). Results reveal regime-dependent behavior: ZACH-ViT (0.25M parameters, trained from scratch) achieves strongest advantage on BloodMNIST and remains competitive on PathMNIST, while relative advantage decreases on datasets with stronger anatomical priors (OCTMNIST, OrganAMNIST), consistent with our hypothesis. Component and pooling ablations show positional support becomes mildly beneficial as spatial structure increases, whereas reintroducing a [CLS] token is consistently unfavorable. These findings support that architectural alignment with data structure can outweigh universal benchmark dominance. Despite minimal size and no pretraining, ZACH-ViT achieves competitive performance under data-scarce conditions, relevant for compact medical imaging and low-resource settings. Code: https://github.com/Bluesman79/ZACH-ViT

翻译：视觉Transformer依赖于编码固定空间先验的位置嵌入和类别标记。尽管这些先验对于自然图像有效，但在空间布局信息较弱的情况下（医学成像中的常见情形）可能并非最优。我们提出了ZACH-ViT（零标记自适应紧凑分层视觉Transformer），这是一种移除位置嵌入和[CLS]标记的紧凑视觉Transformer，通过全局平均池化实现置换不变的图像块处理。零标记表示移除了专用的聚合标记和位置编码。图像块标记保持不变。自适应残差投影在严格的参数约束下保持了训练稳定性。我们在严格的少样本协议下（每类50个样本、固定超参数、五个随机种子）评估了ZACH-ViT在七个MedMNIST数据集上的性能。结果揭示了依赖于数据范式的行为：ZACH-ViT（0.25M参数，从头训练）在BloodMNIST上获得最显著优势，在PathMNIST上保持竞争力，而在具有更强解剖学先验的数据集（OCTMNIST、OrganAMNIST）上相对优势降低，这与我们的假设一致。组件和池化消融实验表明，随着空间结构增强，位置支持变得略有裨益，而重新引入[CLS]标记则始终不利。这些发现支持了架构与数据结构的对齐可以超越通用基准主导性的观点。尽管模型尺寸极小且未经预训练，ZACH-ViT在数据稀缺条件下仍能取得有竞争力的性能，这对于紧凑型医学成像和低资源场景具有重要意义。代码：https://github.com/Bluesman79/ZACH-ViT