Extensive work has demonstrated the effectiveness of Vision Transformers. The plain Vision Transformer tends to obtain multi-scale features by selecting fixed layers, or the last layer of features aiming to achieve higher performance in dense prediction tasks. However, this selection is often based on manual operation. And different samples often exhibit different features at different layers (e.g., edge, structure, texture, detail, etc.). This requires us to seek a dynamic adaptive fusion method to filter different layer features. In this paper, unlike previous encoder and decoder work, we design a neck network for adaptive fusion and feature selection, called ViTController. We validate the effectiveness of our method on different datasets and models and surpass previous state-of-the-art methods. Finally, our method can also be used as a plug-in module and inserted into different networks.
翻译:大量工作已证明视觉变换器(Vision Transformers)的有效性。普通视觉变换器通常通过选择固定层或最后一层特征来获取多尺度特征,以在密集预测任务中实现更高性能。然而,这种选择往往基于人工操作。不同样本在不同层(如边缘、结构、纹理、细节等)常表现出不同特征。这要求我们寻求一种动态自适应融合方法来过滤不同层特征。本文不同于以往编码器-解码器工作,设计了一种用于自适应融合和特征选择的颈部网络,称为ViTController。我们在不同数据集和模型上验证了该方法的有效性,并超越了先前的最先进方法。最后,我们的方法也可作为即插即用模块嵌入不同网络。