Revisiting Surgical Instrument Segmentation Without Human Intervention: A Graph Partitioning View

from arxiv, Accepted by The 32nd ACM International Conference on Multimedia (ACM MM 2024) Workshop on Multimedia Computing for Health and Medicine (MCHM)

Surgical instrument segmentation (SIS) on endoscopic images stands as a long-standing and essential task in the context of computer-assisted interventions for boosting minimally invasive surgery. Given the recent surge of deep learning methodologies and their data-hungry nature, training a neural predictive model based on massive expert-curated annotations has been dominating and served as an off-the-shelf approach in the field, which could, however, impose prohibitive burden to clinicians for preparing fine-grained pixel-wise labels corresponding to the collected surgical video frames. In this work, we propose an unsupervised method by reframing the video frame segmentation as a graph partitioning problem and regarding image pixels as graph nodes, which is significantly different from the previous efforts. A self-supervised pre-trained model is firstly leveraged as a feature extractor to capture high-level semantic features. Then, Laplacian matrixs are computed from the features and are eigendecomposed for graph partitioning. On the "deep" eigenvectors, a surgical video frame is meaningfully segmented into different modules such as tools and tissues, providing distinguishable semantic information like locations, classes, and relations. The segmentation problem can then be naturally tackled by applying clustering or threshold on the eigenvectors. Extensive experiments are conducted on various datasets (e.g., EndoVis2017, EndoVis2018, UCL, etc.) for different clinical endpoints. Across all the challenging scenarios, our method demonstrates outstanding performance and robustness higher than unsupervised state-of-the-art (SOTA) methods. The code is released at https://github.com/MingyuShengSMY/GraphClusteringSIS.git.

翻译：内窥镜图像中的手术器械分割（SIS）是计算机辅助干预领域中一项长期存在且至关重要的任务，旨在推动微创手术的发展。鉴于深度学习方法的近期兴起及其对数据需求的本质，基于大量专家标注数据训练神经预测模型已成为该领域的主流且现成的解决方案。然而，这给临床医生带来了巨大负担，因为他们需要为收集的手术视频帧准备精细的像素级标注。在本工作中，我们提出了一种无监督方法，通过将视频帧分割重新定义为图分割问题，并将图像像素视为图节点，这与先前的研究工作有显著不同。首先，利用自监督预训练模型作为特征提取器来捕获高级语义特征。随后，从这些特征计算拉普拉斯矩阵并进行特征分解以实现图分割。在"深层"特征向量上，手术视频帧被有意义地分割为不同模块（如器械和组织），提供可区分的语义信息，包括位置、类别及关系。分割问题随后可通过在特征向量上应用聚类或阈值处理自然解决。我们在多个数据集（如EndoVis2017、EndoVis2018、UCL等）上针对不同临床终点进行了广泛实验。在所有挑战性场景中，我们的方法均展现出卓越的性能和鲁棒性，优于当前无监督领域的最先进（SOTA）方法。代码发布于 https://github.com/MingyuShengSMY/GraphClusteringSIS.git。