Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer's critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.
翻译:目的:数据隐私法规限制了多机构数据聚合,从而阻碍了可泛化的外科手术基础模型的创建。本研究探讨了联邦学习作为一种隐私保护解决方案,以协作训练鲁棒的外科手术基础模型。方法:我们提出了联邦EndoViT,这是一个联邦框架,在去中心化的外科手术环境中验证了掩码自编码器预训练策略。为了确保在严重数据异质性下的收敛性,该架构集成了自适应锐度感知最小化。在大规模Endo700k数据集上进行预训练后,联邦EndoViT在场景分割、动作识别和阶段识别等不同任务上,与集中式基线模型进行了性能比较。结果:自适应锐度感知最小化对于成功预训练至关重要,克服了标准联邦方法的收敛失败问题。由此产生的联邦EndoViT性能与其集中式对应模型相当,在数据稀缺、高分辨率分割以及泛化到新外科手术事件方面具有显著优势。我们还证实,完整、端到端的微调对于获得最佳性能是必要的。结论:本研究验证了结合自适应优化的联邦学习作为创建鲁棒、隐私保护外科手术基础模型的可行范式。我们的研究结果为协作式外科数据科学提供了一个可扩展的框架,并强调了优化器在处理数据异质性方面的关键作用。未来的工作应探索基于视频的模型,以纳入时空动态。