Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
翻译:超声心动图(ECHO)视频被广泛用于心脏检查。在临床实践中,该过程高度依赖操作者经验,需要多年训练,并可能需要基于深度学习系统的辅助以提升准确性与效率。然而,由于获取充足的定制化数据(如异常病例)用于新手培训和深度模型开发在临床上不切实际,这带来了巨大挑战。因此,可控的超声心动图视频合成具有迫切需求。本文提出一种名为HeartBeat的新型扩散模型框架,旨在实现可控且高保真的超声心动图视频合成。本工作的亮点主要体现在三个方面。首先,HeartBeat作为一个统一框架,能够同时感知多模态条件以引导可控生成。其次,我们将多模态条件分解为局部与全局条件,并分别提供两种嵌入策略,以可组合且灵活的方式实现细粒度与粗粒度的控制。通过这种方式,用户可通过组合多模态控制信号合成符合其心理意象的超声心动图视频。第三,我们提出采用两阶段训练方案解耦视觉概念与时间动态学习,从而简化模型训练。另一个有趣的现象是,HeartBeat能够通过少量样本轻松泛化至掩码引导的心脏MRI合成,展示了其面向更广泛应用的扩展性。在两个公开数据集上的大量实验验证了所提HeartBeat框架的有效性。