In this work for Capsule Vision Challenge 2024, we addressed the challenge of multiclass anomaly classification in video capsule Endoscopy (VCE)[1] with a variety of deep learning models, ranging from custom CNNs to advanced transformer architectures. The purpose is to correctly classify diverse gastrointestinal disorders, which is critical for increasing diagnostic efficiency in clinical settings. We started with a baseline CNN model and improved performance with ResNet[2] for better feature extraction, followed by Vision Transformer (ViT)[3] to capture global dependencies. We further improve the results by using Multiscale Vision Transformer (MViT)[4] for improved hierarchical feature extraction, while Dual Attention Vision Transformer (DaViT) [5] delivered best results by combining spatial and channel attention methods. Our best balanced accuracy on validation set [6] was 0.8592 and Mean AUC was 0.9932. This methodology enabled us to improve model accuracy across a wide range of criteria, greatly surpassing all other methods.Additionally, our team capsule commandos achieved 7th place ranking with a test set[7] performance of Mean AUC: 0.7314 and balanced accuracy: 0.3235
翻译:在本次Capsule Vision Challenge 2024工作中,我们采用从定制CNN到先进Transformer架构的多种深度学习模型,应对视频胶囊内窥镜(VCE)[1]中的多类别异常分类挑战。该研究旨在准确分类多种胃肠道疾病,这对提升临床诊断效率至关重要。我们以基准CNN模型为起点,通过ResNet[2]改进特征提取能力以提升性能,随后采用Vision Transformer(ViT)[3]捕捉全局依赖关系。通过使用Multiscale Vision Transformer(MViT)[4]增强分层特征提取能力,我们进一步优化了结果,而结合空间与通道注意力机制的Dual Attention Vision Transformer(DaViT)[5]取得了最佳性能。我们在验证集[6]上获得的最佳平衡准确率为0.8592,平均AUC达0.9932。该方法使我们在多项指标上显著提升了模型精度,大幅超越所有其他方法。此外,我们团队capsule commandos在测试集[7]上以平均AUC:0.7314、平衡准确率:0.3235的成绩位列第七。