Foundation models have exhibited remarkable success in various applications, such as disease diagnosis and text report generation. To date, a foundation model for endoscopic video analysis is still lacking. In this paper, we propose Endo-FM, a foundation model specifically developed using massive endoscopic video data. First, we build a video transformer, which captures both local and global long-range dependencies across spatial and temporal dimensions. Second, we pre-train our transformer model using global and local views via a self-supervised manner, aiming to make it robust to spatial-temporal variations and discriminative across different scenes. To develop the foundation model, we construct a large-scale endoscopy video dataset by combining 9 publicly available datasets and a privately collected dataset from Baoshan Branch of Renji Hospital in Shanghai, China. Our dataset overall consists of over 33K video clips with up to 5 million frames, encompassing various protocols, target organs, and disease types. Our pre-trained Endo-FM can be easily adopted for a given downtream task via fine-tuning by serving as the backbone. With experiments on 3 different types of downstream tasks, including classification, segmentation, and detection, our Endo-FM surpasses the current state-of-the-art self-supervised pre-training and adapter-based transfer learning methods by a significant margin, such as VCL (3.1% F1 for classification, 4.8% Dice for segmentation, and 5.5% F1 for detection) and ST-Adapter (5.9% F1 for classification, 9.6% Dice for segmentation, and 9.9% F1 for detection). Code, datasets, and models are released at https://github.com/med-air/Endo-FM.
翻译:基础模型在诸多应用中已展现出显著成功,例如疾病诊断和文本报告生成。然而,截至目前,用于内窥镜视频分析的基础模型仍属空白。本文提出Endo-FM,一个基于大规模内窥镜视频数据专门开发的基础模型。首先,我们构建了一个视频Transformer,能够捕捉空间和时间维度上局部与全局的长程依赖关系。其次,通过自监督方式利用全局和局部视图对我们的Transformer模型进行预训练,旨在使其对时空变化具有鲁棒性,并能对不同场景进行区分。为开发该基础模型,我们整合了9个公开数据集和来自上海交通大学医学院附属仁济医院宝山分院(中国上海)的私有数据集,构建了一个大规模内窥镜视频数据集。该数据集总计包含超过33,000个视频片段,帧数高达500万,涵盖多种协议、目标器官及疾病类型。预训练后的Endo-FM可通过微调作为骨干网络轻松应用于给定的下游任务。通过在包括分类、分割和检测在内的3种不同类型下游任务上的实验,我们的Endo-FM显著超越了当前最先进的自监督预训练和基于适配器的迁移学习方法,例如VCL(分类F1提升3.1%,分割Dice提升4.8%,检测F1提升5.5%)和ST-Adapter(分类F1提升5.9%,分割Dice提升9.6%,检测F1提升9.9%)。代码、数据集及模型已发布于 https://github.com/med-air/Endo-FM。