Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. The code will be released at https://github.com/OpenGVLab/InternImage.
翻译:与近年来大规模视觉Transformer(ViT)取得的巨大进展相比,基于卷积神经网络(CNN)的大规模模型仍处于早期阶段。本文提出了一种全新的基于CNN的大规模基础模型,名为InternImage,该模型能够像ViT一样从参数和训练数据的增加中获益。与近期关注大型密集核的CNN不同,InternImage以可变形卷积作为核心算子,使模型不仅具备检测、分割等下游任务所需的大有效感受野,还能根据输入和任务信息实现自适应空间聚合。因此,所提出的InternImage减少了传统CNN的强归纳偏置,并使其能够像ViT一样,从海量数据中以大规模参数学习更强大、更鲁棒的模式。该模型的有效性在包括ImageNet、COCO和ADE20K在内的挑战性基准测试中得到验证。值得一提的是,InternImage-H在COCO test-dev上实现了65.4 mAP的新纪录,在ADE20K上达到62.9 mIoU,超越了当前领先的CNN和ViT。代码将在https://github.com/OpenGVLab/InternImage发布。