Recently, MLP-based vision backbones have achieved promising performance in several visual recognition tasks. However, the existing MLP-based methods directly aggregate tokens with static weights, leaving the adaptability to different images untouched. Moreover, Recent research demonstrates that MLP-Transformer is great at creating long-range dependencies but ineffective at catching high frequencies that primarily transmit local information, which prevents it from applying to the downstream dense prediction tasks, such as semantic segmentation. To address these challenges, we propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM). The DSM represents token interactions in the frequency domain by employing the Discrete Cosine Transform, which can learn long-term spatial dependencies with log-linear complexity. Furthermore, a dynamic spectrum weight generation layer is proposed as the spectrum bands selector, which could emphasize the informative frequency bands while diminishing others. To this end, the technique can efficiently learn detailed features from visual input that contains both high- and low-frequency information. Extensive experiments show that DSM is a powerful and adaptable backbone for a range of visual recognition tasks. Particularly, DSM outperforms previous transformer-based and MLP-based models, on image classification, object detection, and semantic segmentation tasks, such as 83.8 \% top-1 accuracy on ImageNet, and 49.9 \% mIoU on ADE20K.
翻译:近期,基于MLP的视觉骨干网络在多项视觉识别任务中取得了显著性能。然而,现有MLP类方法采用静态权重直接聚合令牌,缺乏针对不同图像的适应性。同时,最新研究表明,MLP-Transformer虽擅长建立长程依赖关系,但在捕捉主要传递局部信息的高频成分方面效果不佳,这阻碍了其应用于语义分割等下游密集预测任务。为解决这些问题,我们提出一种内容自适应且计算高效的结构——动态频谱混合器(DSM)。DSM通过离散余弦变换在频域中实现令牌交互,能够以对数线性复杂度学习长程空间依赖。此外,我们设计了动态频谱权重生成层作为频谱带选择器,可增强信息丰富的频带同时抑制其余频带。得益于此,该技术能够从包含高低频信息的视觉输入中高效学习细节特征。大量实验表明,DSM是适用于多种视觉识别任务的强大且可迁移的骨干网络。尤其在图像分类、目标检测和语义分割任务中,DSM超越了此前基于Transformer和MLP的模型,例如在ImageNet上达83.8%的Top-1准确率,在ADE20K上达49.9%的mIoU。