Vision Transformers (ViTs) demonstrate remarkable performance in image classification through visual-token interaction learning, particularly when equipped with local information via region attention or convolutions. Although such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global visual-token representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy input. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformers that uses NCA as plug-and-play adaptors between ViT layers, thus enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Using our analysis of AdaNCA placement and robustness improvement, we also develop an algorithm for identifying the most effective insertion points for AdaNCA. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across eight robustness benchmarks and four ViT architectures that AdaNCA, as a plug-and-play module, consistently improves the robustness of ViTs.
翻译:视觉Transformer(ViT)通过视觉标记的交互学习在图像分类任务中展现出卓越性能,尤其是在通过区域注意力或卷积融入局部信息时。尽管此类架构提升了不同粒度特征聚合的能力,但它们往往难以增强网络的鲁棒性。神经元胞自动机(NCA)能够通过局部交互建模全局视觉标记表示,其训练策略与架构设计赋予了模型强大的泛化能力及对噪声输入的鲁棒性。本文提出用于视觉Transformer的适配器神经元胞自动机(AdaNCA),将NCA作为即插即用适配器嵌入ViT层之间,从而提升ViT在对抗样本与分布外输入下的性能与鲁棒性。为克服标准NCA的高计算开销,我们提出动态交互机制以实现更高效的交互学习。基于对AdaNCA嵌入位置与鲁棒性提升的分析,我们还开发了一种算法以确定AdaNCA的最优插入点。在参数增量低于3%的情况下,AdaNCA在ImageNet1K基准的对抗攻击下实现了超过10%的绝对准确率提升。此外,通过在八个鲁棒性基准和四种ViT架构上的广泛评估,我们证明AdaNCA作为即插即用模块能够持续提升ViT的鲁棒性。