Medical video generation has transformative potential for enhancing surgical understanding and pathology insights through precise and controllable visual representations. However, current models face limitations in controllability and authenticity. To bridge this gap, we propose SurgSora, a motion-controllable surgical video generation framework that uses a single input frame and user-controllable motion cues. SurgSora consists of three key modules: the Dual Semantic Injector (DSI), which extracts object-relevant RGB and depth features from the input frame and integrates them with segmentation cues to capture detailed spatial features of complex anatomical structures; the Decoupled Flow Mapper (DFM), which fuses optical flow with semantic-RGB-D features at multiple scales to enhance temporal understanding and object spatial dynamics; and the Trajectory Controller (TC), which allows users to specify motion directions and estimates sparse optical flow, guiding the video generation process. The fused features are used as conditions for a frozen Stable Diffusion model to produce realistic, temporally coherent surgical videos. Extensive evaluations demonstrate that SurgSora outperforms state-of-the-art methods in controllability and authenticity, showing its potential to advance surgical video generation for medical education, training, and research.
翻译:医学视频生成通过精确且可控的视觉表征,在增强外科理解和病理洞察方面具有变革性潜力。然而,现有模型在可控性和真实性方面存在局限。为弥补这一差距,我们提出了SurgSora,一种运动可控的手术视频生成框架,它使用单帧输入图像和用户可控的运动线索。SurgSora包含三个关键模块:双语义注入器(DSI),从输入帧中提取与对象相关的RGB和深度特征,并将其与分割线索相结合,以捕捉复杂解剖结构的详细空间特征;解耦光流映射器(DFM),在多尺度上将光流与语义-RGB-D特征融合,以增强时序理解和对象空间动态;以及轨迹控制器(TC),允许用户指定运动方向并估计稀疏光流,从而指导视频生成过程。融合后的特征被用作冻结的Stable Diffusion模型的条件,以生成逼真、时序连贯的手术视频。大量评估表明,SurgSora在可控性和真实性方面优于现有最先进方法,展现了其在推动医学教育、培训和研究领域手术视频生成方面的潜力。