CudaSIFT-SLAM: multiple-map visual SLAM for full procedure mapping in real human endoscopy

Monocular visual simultaneous localization and mapping (V-SLAM) is nowadays an irreplaceable tool in mobile robotics and augmented reality, where it performs robustly. However, human colonoscopies pose formidable challenges like occlusions, blur, light changes, lack of texture, deformation, water jets or tool interaction, which result in very frequent tracking losses. ORB-SLAM3, the top performing multiple-map V-SLAM, is unable to recover from them by merging sub-maps or relocalizing the camera, due to the poor performance of its place recognition algorithm based on ORB features and DBoW2 bag-of-words. We present CudaSIFT-SLAM, the first V-SLAM system able to process complete human colonoscopies in real-time. To overcome the limitations of ORB-SLAM3, we use SIFT instead of ORB features and replace the DBoW2 direct index with the more computationally demanding brute-force matching, being able to successfully match images separated in time for relocation and map merging. Real-time performance is achieved thanks to CudaSIFT, a GPU implementation for SIFT extraction and brute-force matching. We benchmark our system in the C3VD phantom colon dataset, and in a full real colonoscopy from the Endomapper dataset, demonstrating the capabilities to merge sub-maps and relocate in them, obtaining significantly longer sub-maps. Our system successfully maps in real-time 88 % of the frames in the C3VD dataset. In a real screening colonoscopy, despite the much higher prevalence of occluded and blurred frames, the mapping coverage is 53 % in carefully explored areas and 38 % in the full sequence, a 70 % improvement over ORB-SLAM3.

翻译：单目视觉同时定位与建图（V-SLAM）目前在移动机器人和增强现实领域已成为不可或缺的工具，并在这些场景中表现出鲁棒性。然而，人体结肠镜检查面临诸多严峻挑战，如遮挡、模糊、光照变化、纹理缺失、组织形变、水流喷射或器械交互等，这些因素导致跟踪丢失极为频繁。当前性能最优的多地图V-SLAM系统ORB-SLAM3，由于其基于ORB特征和DBoW2词袋模型的位置识别算法性能不足，无法通过合并子地图或重定位相机从跟踪失败中恢复。本文提出CudaSIFT-SLAM，这是首个能够实时处理完整人体结肠镜检查流程的V-SLAM系统。为克服ORB-SLAM3的局限性，我们采用SIFT特征替代ORB特征，并以计算需求更高的暴力匹配取代DBoW2直接索引，从而能够成功匹配时间上相隔较远的图像以实现重定位与地图合并。实时性能的达成得益于CudaSIFT——一种在GPU上实现的SIFT特征提取与暴力匹配算法。我们在C3VD仿真结肠数据集和Endomapper数据集的真实完整结肠镜序列上对本系统进行评测，结果表明系统具备子地图合并与重定位能力，并能获得显著更长的子地图轨迹。本系统在C3VD数据集中成功实现了88%帧数的实时建图覆盖。在真实筛查结肠镜序列中，尽管存在更高比例的遮挡与模糊帧，在仔细探查区域的建图覆盖率达53%，全序列覆盖率达38%，较ORB-SLAM3提升70%。