Vision Transformers (ViTs) have demonstrated strong performance across a range of computer vision tasks by modeling long-range spatial interactions via self-attention. However, channel-wise mixing in ViTs remains static, relying on fixed multilayer perceptrons (MLPs) that lack adaptability to input content. We introduce 'CAViT', a dual-attention architecture that replaces the static MLP with a dynamic, attention-based mechanism for feature interaction. Each Transformer block in CAViT performs spatial self-attention followed by channel-wise self-attention, allowing the model to dynamically recalibrate feature representations based on global image context. This unified and content-aware token mixing strategy enhances representational expressiveness without increasing depth or complexity. We validate CAViT across five benchmark datasets spanning both natural and medical domains, where it outperforms the standard ViT baseline by up to +3.6% in accuracy, while reducing parameter count and FLOPs by over 30%. Qualitative attention maps reveal sharper and semantically meaningful activation patterns, validating the effectiveness of our attention-driven token mixing.
翻译:视觉Transformer(ViTs)通过自注意力机制建模长程空间交互,在一系列计算机视觉任务中展现出强大性能。然而,ViT中的通道混合机制仍保持静态,依赖于固定的多层感知机(MLPs),缺乏对输入内容的自适应性。本文提出“CAViT”——一种双注意力架构,采用动态的、基于注意力的特征交互机制替代静态MLP。CAViT中的每个Transformer模块依次执行空间自注意力与通道自注意力,使模型能够依据全局图像上下文动态重校准特征表示。这种统一且内容感知的令牌混合策略在不增加网络深度或复杂度的前提下,增强了表征表达能力。我们在涵盖自然与医学领域的五个基准数据集上验证CAViT,其准确率最高超越标准ViT基线+3.6%,同时参数量与FLOPs降低超过30%。定性注意力图谱显示出更清晰且具有语义意义的激活模式,验证了我们注意力驱动令牌混合机制的有效性。