Self-Supervised Uncalibrated Multi-View Video Anonymization in the Operating Room

Privacy preservation is a prerequisite for using video data in Operating Room (OR) research. Effective anonymization relies on the exhaustive localization of every individual; even a single missed detection necessitates extensive manual correction. However, existing approaches face two critical scalability bottlenecks: (1) they usually require manual annotations of each new clinical site for high accuracy; (2) while multi-camera setups have been widely adopted to address single-view ambiguity, camera calibration is typically required whenever cameras are repositioned. To address these problems, we propose a novel self-supervised multi-view video anonymization framework consisting of whole-body person detection and whole-body pose estimation, without annotation or camera calibration. Our core strategy is to enhance the single-view detector by "retrieving" false negatives using temporal and multi-view context, and conducting self-supervised domain adaptation. We first run an off-the-shelf whole-body person detector in each view with a low-score threshold to gather candidate detections. Then, we retrieve the low-score false negatives that exhibit consistency with the high-score detections via tracking and self-supervised uncalibrated multi-view association. These recovered detections serve as pseudo labels to iteratively fine-tune the whole-body detector. Finally, we apply whole-body pose estimation on each detected person, and fine-tune the pose model using its own high-score predictions. Experiments on the 4D-OR dataset of simulated surgeries and our dataset of real surgeries show the effectiveness of our approach achieving over 97% recall. Moreover, we train a real-time whole-body detector using our pseudo labels, achieving comparable performance and highlighting our method's practical applicability. Code will be available at https://github.com/CAMMA-public/OR_anonymization.

翻译：隐私保护是利用手术室视频数据进行研究的前提条件。有效的匿名化依赖于对每个个体的精确定位；即使仅存在单个漏检，也需要大量人工修正。然而，现有方法面临两个关键的可扩展性瓶颈：(1) 通常需要对每个新的临床站点进行人工标注以实现高精度；(2) 虽然多摄像头设置已被广泛采用以解决单视角模糊性问题，但每当摄像头重新定位时通常都需要进行相机标定。为解决这些问题，我们提出了一种新颖的自监督多视角视频匿名化框架，该框架包含全身人员检测和全身姿态估计，且无需标注或相机标定。我们的核心策略是通过利用时序和多视角上下文"检索"假阴性样本，并进行自监督域适应，从而增强单视角检测器。我们首先在每个视角中以低分数阈值运行现成的全身人员检测器以收集候选检测结果。随后，通过跟踪和无需标定的自监督多视角关联，检索出与高分数检测结果具有一致性的低分数假阴性样本。这些恢复的检测结果作为伪标签，用于迭代微调全身检测器。最后，我们对每个检测到的人员进行全身姿态估计，并利用模型自身的高分数预测对姿态模型进行微调。在模拟手术的4D-OR数据集和真实手术数据集上的实验表明，我们的方法实现了超过97%的召回率，验证了其有效性。此外，我们使用伪标签训练了一个实时全身检测器，获得了可比的性能，凸显了我们方法的实际应用价值。代码将在 https://github.com/CAMMA-public/OR_anonymization 发布。