Surgical scene perception via videos are critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets for surgical workflow analysis, which typically face challenges such as small scale, a lack of diversity in surgery and phase categories, and the absence of time-localized annotations, limit the requirements for action understanding and model generalization validation in complex and diverse real-world surgical scenarios. To address this gap, we introduce OphNet, a large-scale, expert-annotated video benchmark for ophthalmic surgical workflow understanding. OphNet features: 1) A diverse collection of 2,278 surgical videos spanning 66 types of cataract, glaucoma, and corneal surgeries, with detailed annotations for 102 unique surgical phases and 150 granular operations; 2) It offers sequential and hierarchical annotations for each surgery, phase, and operation, enabling comprehensive understanding and improved interpretability; 3) Moreover, OphNet provides time-localized annotations, facilitating temporal localization and prediction tasks within surgical workflows. With approximately 205 hours of surgical videos, OphNet is about 20 times larger than the largest existing surgical workflow analysis benchmark. Our dataset and code have been made available at: \url{https://github.com/minghu0830/OphNet-benchmark}.
翻译:通过视频进行手术场景感知对于推进机器人手术、远程手术和人工智能辅助手术至关重要,尤其在眼科领域。然而,缺乏多样化和丰富标注的视频数据集阻碍了用于手术工作流分析的智能系统的发展。现有的手术工作流分析数据集通常面临规模小、手术和阶段类别缺乏多样性以及缺少时间定位标注等挑战,限制了在复杂多样真实手术场景中对动作理解和模型泛化验证的需求。为弥补这一不足,我们提出了OphNet,一个用于眼科手术工作流理解的大规模专家标注视频基准数据集。OphNet具有以下特点:1)包含2,278个手术视频,涵盖66种白内障、青光眼和角膜手术类型,并对102个独特手术阶段和150个细粒度操作进行了详细标注;2)为每台手术、每个阶段和操作提供序列化和层次化标注,从而实现全面理解并提升可解释性;3)此外,OphNet提供时间定位标注,有助于在手术工作流中进行时间定位和预测任务。该数据集包含约205小时的手术视频,规模约为现有最大手术工作流分析基准数据集的20倍。我们的数据集和代码已公开于:\url{https://github.com/minghu0830/OphNet-benchmark}。