Colonoscopy 3D Video Dataset with Paired Depth from 2D-3D Registration

Screening colonoscopy is an important clinical application for several 3D computer vision techniques, including depth estimation, surface reconstruction, and missing region detection. However, the development, evaluation, and comparison of these techniques in real colonoscopy videos remain largely qualitative due to the difficulty of acquiring ground truth data. In this work, we present a Colonoscopy 3D Video Dataset (C3VD) acquired with a high definition clinical colonoscope and high-fidelity colon models for benchmarking computer vision methods in colonoscopy. We introduce a novel multimodal 2D-3D registration technique to register optical video sequences with ground truth rendered views of a known 3D model. The different modalities are registered by transforming optical images to depth maps with a Generative Adversarial Network and aligning edge features with an evolutionary optimizer. This registration method achieves an average translation error of 0.321 millimeters and an average rotation error of 0.159 degrees in simulation experiments where error-free ground truth is available. The method also leverages video information, improving registration accuracy by 55.6% for translation and 60.4% for rotation compared to single frame registration. 22 short video sequences were registered to generate 10,015 total frames with paired ground truth depth, surface normals, optical flow, occlusion, six degree-of-freedom pose, coverage maps, and 3D models. The dataset also includes screening videos acquired by a gastroenterologist with paired ground truth pose and 3D surface models. The dataset and registration source code are available at durr.jhu.edu/C3VD.

翻译：结肠镜筛查是多项三维计算机视觉技术（包括深度估计、表面重建及缺失区域检测）的重要临床应用场景。然而，由于真实标注数据获取困难，这些技术在真实结肠镜视频中的开发、评估与比较仍主要停留在定性层面。本研究提出一个基于高清临床结肠镜与高保真结肠模型采集的结肠镜3D视频数据集（C3VD），旨在为结肠镜计算机视觉方法提供基准测试平台。我们创新性地采用多模态2D-3D配准技术，将光学视频序列与已知三维模型的真实标注渲染视图进行配准。该技术通过生成对抗网络将光学图像转换为深度图，并利用进化优化器对齐边缘特征，从而实现不同模态间的配准。在可获得无误差真实标注的仿真实验中，本配准方法平均平移误差为0.321毫米，平均旋转误差为0.159度。该方法还充分利用视频时序信息，相较于单帧配准，平移精度提升55.6%，旋转精度提升60.4%。通过配准22组短视频序列生成10,015帧数据，每帧均包含配对的真实深度图、表面法向量、光流、遮挡信息、六自由度位姿、覆盖图及三维模型。数据集还包含胃肠病专家采集的筛查视频，并配有真实位姿与三维表面模型。数据集及配准源代码已发布于durr.jhu.edu/C3VD。