In this work, we introduce SPFormer, a novel Vision Transformer enhanced by superpixel representation. Addressing the limitations of traditional Vision Transformers' fixed-size, non-adaptive patch partitioning, SPFormer employs superpixels that adapt to the image's content. This approach divides the image into irregular, semantically coherent regions, effectively capturing intricate details and applicable at both initial and intermediate feature levels. SPFormer, trainable end-to-end, exhibits superior performance across various benchmarks. Notably, it exhibits significant improvements on the challenging ImageNet benchmark, achieving a 1.4% increase over DeiT-T and 1.1% over DeiT-S respectively. A standout feature of SPFormer is its inherent explainability. The superpixel structure offers a window into the model's internal processes, providing valuable insights that enhance the model's interpretability. This level of clarity significantly improves SPFormer's robustness, particularly in challenging scenarios such as image rotations and occlusions, demonstrating its adaptability and resilience.
翻译:本文提出SPFormer——一种通过超像素表示增强的新型视觉Transformer。针对传统视觉Transformer固定尺寸、非自适应分块方案的局限性,SPFormer采用能适应图像内容的超像素方法。该方法将图像划分为不规则的语义连贯区域,能有效捕捉细节特征,并适用于初始特征层级与中间特征层级。SPFormer支持端到端训练,在多个基准测试中展现出优越性能。特别值得关注的是,在具有挑战性的ImageNet基准上,其相较DeiT-T提升1.4%,较DeiT-S提升1.1%。SPFormer的显著特性在于其内在可解释性。超像素结构为理解模型内部运作机制提供了窗口,所提供的深刻洞见增强了模型的可解释性。这种清晰度显著提升了SPFormer的鲁棒性,尤其在图像旋转和遮挡等挑战性场景中,充分展现了其适应性与抗干扰能力。