Transformer and its variants have shown great potential for various vision tasks in recent years, including image classification, object detection and segmentation. Meanwhile, recent studies also reveal that with proper architecture design, convolutional networks (ConvNets) also achieve competitive performance with transformers. However, no prior methods have explored to utilize pure convolution to build a Transformer-style Decoder module, which is essential for Encoder-Decoder architecture like Detection Transformer (DETR). To this end, in this paper we explore whether we could build query-based detection and segmentation framework with ConvNets instead of sophisticated transformer architecture. We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers. Equipped with the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-18 and ResNet-50 backbone, our DECO achieves $40.5\%$ and $47.8\%$ AP with $66$ and $34$ FPS, respectively. The proposed method is also evaluated on the segment anything task, demonstrating similar performance and higher efficiency. We hope the proposed method brings another perspective for designing architectures for vision tasks. Codes are available at https://github.com/xinghaochen/DECO and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO.
翻译:近年来,Transformer及其变体在图像分类、目标检测和分割等多种视觉任务中展现出巨大潜力。同时,近期研究也表明,通过适当的架构设计,卷积网络(ConvNets)同样能够达到与Transformer相媲美的性能。然而,现有方法尚未探索如何利用纯卷积构建Transformer风格的解码器模块,而这对于检测Transformer(DETR)等编码器-解码器架构至关重要。为此,本文探讨了是否能够使用卷积网络而非复杂的Transformer架构来构建基于查询的检测与分割框架。我们提出了一种称为InterConv的新机制,通过卷积层实现对象查询与图像特征之间的交互。基于所提出的InterConv,我们构建了检测卷积网络(DECO),该网络由骨干网络和卷积编码器-解码器架构组成。我们在具有挑战性的COCO基准测试中将所提出的DECO与现有检测器进行了比较。尽管结构简洁,DECO在检测精度和运行速度方面均表现出竞争力。具体而言,在ResNet-18和ResNet-50骨干网络下,DECO分别实现了$40.5\%$和$47.8\%$的平均精度(AP),帧率分别达到$66$和$34$ FPS。所提方法在通用分割任务上也进行了评估,表现出相似的性能和更高的效率。我们希望该方法能为视觉任务的架构设计提供新的视角。代码发布于https://github.com/xinghaochen/DECO 与 https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO。