Deep learning can predict depth maps and capsule ego-motion from capsule endoscopy videos, aiding in 3D scene reconstruction and lesion localization. However, the collisions of the capsule endoscopies within the gastrointestinal tract cause vibration perturbations in the training data. Existing solutions focus solely on vision-based processing, neglecting other auxiliary signals like vibrations that could reduce noise and improve performance. Therefore, we propose V$^2$-SfMLearner, a multimodal approach integrating vibration signals into vision-based depth and capsule motion estimation for monocular capsule endoscopy. We construct a multimodal capsule endoscopy dataset containing vibration and visual signals, and our artificial intelligence solution develops an unsupervised method using vision-vibration signals, effectively eliminating vibration perturbations through multimodal learning. Specifically, we carefully design a vibration network branch and a Fourier fusion module, to detect and mitigate vibration noises. The fusion framework is compatible with popular vision-only algorithms. Extensive validation on the multimodal dataset demonstrates superior performance and robustness against vision-only algorithms. Without the need for large external equipment, our V$^2$-SfMLearner has the potential for integration into clinical capsule robots, providing real-time and dependable digestive examination tools. The findings show promise for practical implementation in clinical settings, enhancing the diagnostic capabilities of doctors.
翻译:深度学习能够从胶囊内窥镜视频中预测深度图与胶囊自身运动,有助于三维场景重建与病灶定位。然而,胶囊内窥镜在胃肠道内的碰撞会导致训练数据中的振动扰动。现有解决方案仅专注于基于视觉的处理,忽略了其他辅助信号(如振动),而这些信号可能降低噪声并提升性能。因此,我们提出V$^2$-SfMLearner,一种多模态方法,将振动信号集成到基于视觉的深度与胶囊运动估计中,用于单目胶囊内窥镜。我们构建了一个包含振动与视觉信号的多模态胶囊内窥镜数据集,并开发了一种利用视觉-振动信号的无监督人工智能解决方案,通过多模态学习有效消除振动扰动。具体而言,我们精心设计了一个振动网络分支与一个傅里叶融合模块,以检测并减轻振动噪声。该融合框架与主流的纯视觉算法兼容。在多模态数据集上的广泛验证表明,相较于纯视觉算法,本方法具有更优的性能与鲁棒性。无需大型外部设备,我们的V$^2$-SfMLearn具有集成到临床胶囊机器人中的潜力,可提供实时且可靠的消化道检查工具。研究结果展现了在临床环境中实际应用的广阔前景,有望提升医生的诊断能力。