Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.
翻译:手术室活动的细粒度理解有助于实现工作流感知辅助,但由于场景杂乱、遮挡及传感限制,该任务仍面临挑战。当前建模手术室环境的主流方法是采用场景图作为交互过程的可解释表征。然而,若缺乏显式时序建模,将逐帧关系预测转化为时间延展的细粒度动作十分困难。为实现对现有手术室理解方法的原则性时序评估,我们基于公开的自我-外部视角手术室数据集,通过定义细粒度多角色动作分类体系,并利用真实场景图状态变化信息蒸馏生成密集动作片段,构建了首个以动作为核心的基准测试集。在该基准上的实验表明,即使引入图神经网络的显式建模,当前场景图预测方法仍难以有效建模时序结构。为此,我们提出纯视觉时序模型,当利用全部可用自中心视角视频作为输入时,其性能显著优于基于图的方法。基于该模型,我们进一步提出一种新颖的多视角到单视角特征对齐策略,该策略可提升单视角场景下多角色动作识别的性能,从而减少对大量自中心视频采集的依赖。基准测试与代码将在论文被接收后公开。