OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels

In the human vision system, top-down attention plays a crucial role in perception, wherein the brain initially performs an overall but rough scene analysis to extract salient cues (i.e., overview first), followed by a finer-grained examination to make more accurate judgments (i.e., look closely next). However, recent efforts in ConvNet designs primarily focused on increasing kernel size to obtain a larger receptive field without considering this crucial biomimetic mechanism to further improve performance. To this end, we propose a novel pure ConvNet vision backbone, termed OverLoCK, which is carefully devised from both the architecture and mixer perspectives. Specifically, we introduce a biomimetic Deep-stage Decomposition Strategy (DDS) that fuses semantically meaningful context representations into middle and deep layers by providing dynamic top-down context guidance at both feature and kernel weight levels. To fully unleash the power of top-down context guidance, we further propose a novel \textbf{Cont}ext-\textbf{Mix}ing Dynamic Convolution (ContMix) that effectively models long-range dependencies while preserving inherent local inductive biases even when the input resolution increases. These properties are absent in previous convolutions. With the support from both DDS and ContMix, our OverLoCK exhibits notable performance improvement over existing methods. For instance, OverLoCK-T achieves a Top-1 accuracy of 84.2\%, significantly surpassing ConvNeXt-B while only using around one-third of the FLOPs/parameters. On object detection with Cascade Mask R-CNN, our OverLoCK-S surpasses MogaNet-B by a significant 1\% in AP$^b$. On semantic segmentation with UperNet, our OverLoCK-T remarkably improves UniRepLKNet-T by 1.7\% in mIoU. Code is publicly available at https://github.com/LMMMEng/OverLoCK.

翻译：在人类视觉系统中，自上而下的注意力在感知过程中起着关键作用：大脑首先进行整体而粗略的场景分析以提取显著线索（即“先概览”），随后进行更精细的检查以做出更准确的判断（即“后细察”）。然而，近期卷积网络的设计主要聚焦于通过增大卷积核尺寸来获取更大的感受野，却未考虑利用这一重要的仿生机制来进一步提升性能。为此，我们提出了一种新颖的纯卷积网络视觉骨干模型，称为OverLoCK，该模型从架构设计和混合机制两个角度进行了精心设计。具体而言，我们引入了一种仿生的深度阶段分解策略，通过在特征层面和卷积核权重层面提供动态的自上而下上下文指导，将具有语义意义的上下文表征融合至中深层网络中。为充分释放自上而下上下文指导的潜力，我们进一步提出了一种新颖的**上下文混合动态卷积**，该模块在输入分辨率增加时仍能有效建模长程依赖关系，同时保持固有的局部归纳偏置，这些特性是先前卷积操作所不具备的。在深度阶段分解策略和上下文混合动态卷积的共同支持下，我们的OverLoCK模型相较于现有方法展现出显著的性能提升。例如，OverLoCK-T在ImageNet分类任务中取得了84.2%的Top-1准确率，在仅使用约三分之一计算量与参数量的情况下显著超越了ConvNeXt-B。在使用Cascade Mask R-CNN进行目标检测时，我们的OverLoCK-S在边界框平均精度上较MogaNet-B显著提升1%。在使用UperNet进行语义分割时，我们的OverLoCK-T在平均交并比指标上较UniRepLKNet-T显著提升1.7%。代码已公开于https://github.com/LMMMEng/OverLoCK。