The convolutional neural network (CNN) and transformer are two of the most widely implemented models in the computer vision field. However, the former (latter) one mainly captures local (global) features only. To address the limitation in model performance caused by the lack of features, we develop a novel classification network CECT by controllable ensemble CNN and transformer. CECT is composed of a convolutional encoder block, a transposed-convolutional decoder block, and a transformer classification block. Different from existing methods, our CECT can capture features at both multi-local and global scales without any bells and whistles. Moreover, the contribution of local features at different scales can be controlled with the proposed ensemble coefficients. We evaluate CECT on two public COVID-19 datasets and it outperforms existing state-of-the-art methods. With remarkable feature capture ability, we believe CECT can be extended to other medical image classification scenarios as a diagnosis assistant. Code is available at https://github.com/NUS-Tim/CECT.
翻译:卷积神经网络(CNN)与Transformer是计算机视觉领域中应用最广泛的两种模型。然而,前者(后者)主要仅捕获局部(全局)特征。为克服特征缺失导致的模型性能局限,我们通过可控集成CNN与Transformer提出了一种新型分类网络CECT。CECT由卷积编码器模块、转置卷积解码器模块和Transformer分类模块组成。与现有方法不同,我们的CECT无需任何额外技巧即可在多局部尺度和全局尺度上同时捕获特征。此外,通过所提出的集成系数可以控制不同尺度局部特征的贡献度。我们在两个公开COVID-19数据集上评估了CECT,该方法优于现有最先进技术。凭借卓越的特征捕获能力,我们相信CECT可作为诊断辅助工具扩展到其他医学图像分类场景。代码见https://github.com/NUS-Tim/CECT。