Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.
翻译:基于查询的视觉Transformer分割模型通常通过重建稠密空间特征图来预测掩码,继承了卷积架构的设计范式。我们证明这种显式的图像空间重建并非必要。本文提出TokenMask——一种在token空间进行掩码预测的模块,可直接从查询-令牌亲和度计算掩码logits,并在logit空间而非特征空间执行插值。该重构方法保留了原始线性评分机制,同时简化了计算结构。在多种ViT骨干网络、数据集和分割任务中,TokenMask通过降低计算与内存需求持续提升效率,同时保持具有竞争力的精度,并在NVIDIA Jetson AGX Orin平台使用TensorRT FP16推理时实现显著加速。总体而言,TokenMask为嵌入式视觉系统提供了更简洁易部署的设计方案。