In this study, we introduce FilterViT, an enhanced version of MobileViT, which leverages an attention-based mechanism for early-stage downsampling. Traditional QKV operations on high-resolution feature maps are computationally intensive due to the abundance of tokens. To address this, we propose a filter attention mechanism using a convolutional neural network (CNN) to generate an importance mask, focusing attention on key image regions. The method significantly reduces computational complexity while maintaining interpretability, as it highlights essential image areas. Experimental results show that FilterViT achieves substantial gains in both efficiency and accuracy compared to other models. We also introduce DropoutViT, a variant that uses a stochastic approach for pixel selection, further enhancing robustness.
翻译:本研究提出FilterViT,作为MobileViT的增强版本,其利用基于注意力的机制进行早期下采样。传统QKV操作在高分辨率特征图上因令牌数量庞大而计算密集。为此,我们提出一种采用卷积神经网络(CNN)生成重要性掩码的滤波注意力机制,将注意力集中于关键图像区域。该方法在保持可解释性的同时显著降低了计算复杂度,因其能突出图像核心区域。实验结果表明,相较于其他模型,FilterViT在效率与精度上均取得显著提升。我们还提出DropoutViT变体,其采用随机方法进行像素选择,进一步增强了模型鲁棒性。