Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

翻译：近期提出的ZACH-ViT（零标记自适应紧凑层次视觉变换器）将一种紧凑的置换不变性视觉变换器形式化应用于医学成像，并论证了与空间结构对齐的架构比通用的基准性能优势更为重要。其设计动机源于观察：位置嵌入和专用分类标记编码了固定的空间假设，当空间组织信息较弱、局部分布或在生物医学图像中变化时，这些假设可能并非最优。基础研究建立了在MedMNIST上依赖数据规模的干净性能特征，但未详细考察鲁棒性。本研究首次提出了ZACH-ViT的鲁棒性扩展，通过在相同的低数据设置下评估其对常见图像损坏和对抗性扰动的行为。我们将ZACH-ViT与三种从头训练的紧凑基线模型（ABMIL、Minimal-ViT和TransMIL）在七个MedMNIST数据集上进行比较，每类使用50个样本、固定超参数和五个随机种子。在基准测试中，ZACH-ViT在干净数据（1.57）和常见损坏条件（1.57）下均取得最佳平均排名，表明其在基线预测性能和对现实图像退化的鲁棒性之间达到了有利平衡。在对抗性压力下，所有模型均显著退化；尽管如此，ZACH-ViT仍保持竞争力，在FGSM下排名第一（2.00），在PGD下排名第二（2.29），而ABMIL在整体上表现最优。这些结果扩展了原始ZACH-ViT的论点：紧凑置换不变性变换器的优势不仅限于干净评估，在低数据医学成像的现实扰动压力下仍可保持，但对抗性鲁棒性对所有评估模型而言仍是一个开放挑战。