Most computer vision models are developed based on either convolutional neural network (CNN) or transformer, while the former (latter) method captures local (global) features. To relieve model performance limitations due to the lack of global (local) features, we develop a novel classification network CECT by controllable ensemble CNN and transformer. CECT is composed of a convolutional encoder block, a transposed-convolutional decoder block, and a transformer classification block. Different from conventional CNN- or transformer-based methods, our CECT can capture features at both multi-local and global scales. Besides, the contribution of local features at different scales can be controlled with the proposed ensemble coefficients. We evaluate CECT on two public COVID-19 datasets and it outperforms existing state-of-the-art methods on all evaluation metrics. With remarkable feature capture ability, we believe CECT can be extended to other medical image classification scenarios as a diagnosis assistant.
翻译:现有计算机视觉模型主要基于卷积神经网络(CNN)或Transformer开发,前者(后者)擅长捕捉局部(全局)特征。为缓解因缺失全局(局部)特征导致的模型性能局限,我们提出一种新型分类网络CECT,通过可控集成CNN与Transformer实现。CECT由卷积编码模块、转置卷积解码模块和Transformer分类模块组成。不同于传统基于CNN或Transformer的方法,CECT能够同时捕捉多尺度局部特征与全局特征。此外,通过引入集成系数,可调控不同尺度局部特征的贡献权重。我们在两个公开COVID-19数据集上评估CECT,其在所有评价指标上均优于现有最优方法。凭借卓越的特征捕捉能力,我们认为CECT可作为诊断辅助工具扩展到其他医学图像分类场景。