Learning to Adapt Foundation Model DINOv2 for Capsule Endoscopy Diagnosis

Foundation models have become prominent in computer vision, achieving notable success in various tasks. However, their effectiveness largely depends on pre-training with extensive datasets. Applying foundation models directly to small datasets of capsule endoscopy images from scratch is challenging. Pre-training on broad, general vision datasets is crucial for successfully fine-tuning our model for specific tasks. In this work, we introduce a simplified approach called Adapt foundation models with a low-rank adaptation (LoRA) technique for easier customization. Our method, inspired by the DINOv2 foundation model, applies low-rank adaptation learning to tailor foundation models for capsule endoscopy diagnosis effectively. Unlike traditional fine-tuning methods, our strategy includes LoRA layers designed to absorb specific surgical domain knowledge. During the training process, we keep the main model (the backbone encoder) fixed and focus on optimizing the LoRA layers and the disease classification component. We tested our method on two publicly available datasets for capsule endoscopy disease classification. The results were impressive, with our model achieving 97.75% accuracy on the Kvasir-Capsule dataset and 98.81% on the Kvasirv2 dataset. Our solution demonstrates that foundation models can be adeptly adapted for capsule endoscopy diagnosis, highlighting that mere reliance on straightforward fine-tuning or pre-trained models from general computer vision tasks is inadequate for such specific applications.

翻译：基础模型在计算机视觉领域日益突出，在各种任务中取得了显著成功。然而，其有效性很大程度上依赖于使用大规模数据集进行预训练。直接将基础模型应用于小规模胶囊内镜图像数据集从头开始训练具有挑战性。在广泛通用的视觉数据集上进行预训练，对于成功微调我们的模型以适应特定任务至关重要。在本工作中，我们引入了一种简化的方法，称为使用低秩适配（LoRA）技术适配基础模型，以实现更便捷的定制。我们的方法受DINOv2基础模型启发，应用低秩适配学习来有效地为基础模型定制胶囊内镜诊断功能。与传统的微调方法不同，我们的策略包含了专门设计用于吸收特定手术领域知识的LoRA层。在训练过程中，我们保持主干模型（骨干编码器）固定，专注于优化LoRA层和疾病分类组件。我们在两个公开可用的胶囊内镜疾病分类数据集上测试了我们的方法。结果令人印象深刻，我们的模型在Kvasir-Capsule数据集上达到了97.75%的准确率，在Kvasirv2数据集上达到了98.81%的准确率。我们的解决方案表明，基础模型能够被巧妙地适配用于胶囊内镜诊断，这凸显了对于此类特定应用，仅仅依赖简单的微调或来自通用计算机视觉任务的预训练模型是不够的。