Graph-based holistic scene representations facilitate surgical workflow understanding and have recently demonstrated significant success. However, this task is often hindered by the limited availability of densely annotated surgical scene data. In this work, we introduce an end-to-end framework for the generation and optimization of surgical scene graphs on a downstream task. Our approach leverages the flexibility of graph-based spectral clustering and the generalization capability of foundation models to generate unsupervised scene graphs with learnable properties. We reinforce the initial spatial graph with sparse temporal connections using local matches between consecutive frames to predict temporally consistent clusters across a temporal neighborhood. By jointly optimizing the spatiotemporal relations and node features of the dynamic scene graph with the downstream task of phase segmentation, we address the costly and annotation-burdensome task of semantic scene comprehension and scene graph generation in surgical videos using only weak surgical phase labels. Further, by incorporating effective intermediate scene representation disentanglement steps within the pipeline, our solution outperforms the SOTA on the CATARACTS dataset by 8% accuracy and 10% F1 score in surgical workflow recognition
翻译:基于图的全场景表示有助于手术工作流理解,近期已展现出显著成效。然而,该任务常受限于密集标注手术场景数据的稀缺性。本研究提出一种端到端框架,用于在下游任务中生成并优化手术场景图。该方法结合了基于图的谱聚类的灵活性与基础模型的泛化能力,以生成具有可学习特性的无监督场景图。我们通过在连续帧间进行局部匹配来建立稀疏时序连接,从而增强初始空间图,实现在时序邻域内预测时序一致的聚类簇。通过将动态场景图的时空关系与节点特征,与手术阶段分割这一下游任务进行联合优化,我们仅利用弱监督的手术阶段标签,即可应对手术视频中语义场景理解与场景图生成这一成本高昂且标注繁重的任务。此外,通过在流程中引入有效的中间场景表示解耦步骤,我们的方案在CATARACTS数据集上的手术工作流识别任务中,以准确率提升8%和F1分数提升10%的优势超越了当前最优方法。