Diffusion-based video generation models have made significant strides, producing outputs with improved visual fidelity, temporal coherence, and user control. These advancements hold great promise for improving surgical education by enabling more realistic, diverse, and interactive simulation environments. In this study, we introduce SurGen, a text-guided diffusion model tailored for surgical video synthesis. SurGen produces videos with the highest resolution and longest duration among existing surgical video generation models. We validate the visual and temporal quality of the outputs using standard image and video generation metrics. Additionally, we assess their alignment to the corresponding text prompts through a deep learning classifier trained on surgical data. Our results demonstrate the potential of diffusion models to serve as valuable educational tools for surgical trainees.
翻译:基于扩散的视频生成模型已取得显著进展,其输出在视觉保真度、时间连贯性和用户控制方面均有提升。这些进展通过实现更真实、多样化和交互式的手术模拟环境,为改善外科教育展现出巨大潜力。本研究提出 SurGen,一种专为手术视频合成设计的文本引导扩散模型。SurGen 生成的视频在现有手术视频生成模型中具有最高的分辨率和最长的持续时间。我们使用标准的图像与视频生成指标验证了输出结果的视觉与时间质量。此外,通过一个基于手术数据训练的深度学习分类器,我们评估了生成视频与对应文本提示的匹配程度。我们的研究结果证明了扩散模型作为外科培训有价值教育工具的潜力。