Modality differences have led to the development of heterogeneous architectures for vision and language models. While images typically require 2D non-causal modeling, texts utilize 1D causal modeling. This distinction poses significant challenges in constructing unified multi-modal models. This paper explores the feasibility of representing images using 1D causal modeling. We identify an "over-focus" issue in existing 1D causal vision models, where attention overly concentrates on a small proportion of visual tokens. The issue of "over-focus" hinders the model's ability to extract diverse visual features and to receive effective gradients for optimization. To address this, we propose De-focus Attention Networks, which employ learnable bandpass filters to create varied attention patterns. During training, large and scheduled drop path rates, and an auxiliary loss on globally pooled features for global understanding tasks are introduced. These two strategies encourage the model to attend to a broader range of tokens and enhance network optimization. Extensive experiments validate the efficacy of our approach, demonstrating that 1D causal visual representation can perform comparably to 2D non-causal representation in tasks such as global perception, dense prediction, and multi-modal understanding. Code is released at https://github.com/OpenGVLab/De-focus-Attention-Networks.
翻译:模态差异导致了视觉与语言模型采用异构架构的发展。图像通常需要二维非因果建模,而文本则采用一维因果建模。这种差异给构建统一的多模态模型带来了重大挑战。本文探讨了使用一维因果建模表示图像的可行性。我们发现在现有的一维因果视觉模型中存在“过度聚焦”问题,即注意力过度集中于少量视觉标记上。该问题阻碍了模型提取多样化视觉特征及获得有效优化梯度的能力。为解决此问题,我们提出了去焦注意力网络,该网络采用可学习的带通滤波器来生成多样化的注意力模式。在训练过程中,我们引入了大幅度的计划性丢弃路径率,并针对全局理解任务在全局池化特征上添加了辅助损失函数。这两种策略促使模型关注更广泛的标记范围并增强网络优化效果。大量实验验证了我们方法的有效性,证明一维因果视觉表示在全局感知、密集预测和多模态理解等任务中能够取得与二维非因果表示相当的性能。代码发布于 https://github.com/OpenGVLab/De-focus-Attention-Networks。