Vision Transformers naturally accommodate sparsity, yet standard tokenization methods confine features to discrete patch grids. This constraint prevents models from fully exploiting sparse regimes, forcing awkward compromises. We propose Subpixel Placement of Tokens (SPoT), a novel tokenization strategy that positions tokens continuously within images, effectively sidestepping grid-based limitations. With our proposed oracle-guided search, we uncover substantial performance gains achievable with ideal subpixel token positioning, drastically reducing the number of tokens necessary for accurate predictions during inference. SPoT provides a new direction for flexible, efficient, and interpretable ViT architectures, redefining sparsity as a strategic advantage rather than an imposed limitation.
翻译:视觉Transformer天然支持稀疏性,但标准令牌化方法将特征限制在离散的补丁网格中。这种约束阻碍了模型充分利用稀疏机制,迫使模型做出不自然的折衷。我们提出子像素级令牌放置(SPoT),这是一种新颖的令牌化策略,可将令牌连续定位在图像内部,有效规避基于网格的限制。通过我们提出的预言引导搜索,我们发现理想的子像素令牌定位可带来显著的性能提升,在推理过程中大幅减少准确预测所需的令牌数量。SPoT为灵活、高效且可解释的ViT架构提供了新方向,将稀疏性重新定义为战略优势而非强加的限制。