GEMTrans: A General, Echocardiography-based, Multi-Level Transformer Framework for Cardiovascular Diagnosis

Echocardiography (echo) is an ultrasound imaging modality that is widely used for various cardiovascular diagnosis tasks. Due to inter-observer variability in echo-based diagnosis, which arises from the variability in echo image acquisition and the interpretation of echo images based on clinical experience, vision-based machine learning (ML) methods have gained popularity to act as secondary layers of verification. For such safety-critical applications, it is essential for any proposed ML method to present a level of explainability along with good accuracy. In addition, such methods must be able to process several echo videos obtained from various heart views and the interactions among them to properly produce predictions for a variety of cardiovascular measurements or interpretation tasks. Prior work lacks explainability or is limited in scope by focusing on a single cardiovascular task. To remedy this, we propose a General, Echo-based, Multi-Level Transformer (GEMTrans) framework that provides explainability, while simultaneously enabling multi-video training where the inter-play among echo image patches in the same frame, all frames in the same video, and inter-video relationships are captured based on a downstream task. We show the flexibility of our framework by considering two critical tasks including ejection fraction (EF) and aortic stenosis (AS) severity detection. Our model achieves mean absolute errors of 4.15 and 4.84 for single and dual-video EF estimation and an accuracy of 96.5 % for AS detection, while providing informative task-specific attention maps and prototypical explainability.

翻译：超声心动图是一种广泛应用于多种心血管诊断任务的超声成像模态。由于基于超声心动图的诊断存在观察者间差异（这种差异源于超声图像采集的可变性以及基于临床经验的超声图像解读），基于视觉的机器学习方法作为辅助验证层日益受到关注。在此类安全关键应用中，任何提出的机器学习方法必须兼具良好的准确性和一定程度的可解释性。此外，这类方法需能够处理来自不同心脏切面的多个超声心动图视频及其交互关系，从而针对多种心血管测量或解读任务生成准确预测。现有工作缺乏可解释性，或因聚焦单一心血管任务而存在局限性。为解决此问题，我们提出一种通用、基于超声心动图的多层级Transformer（GEMTrans）框架，该框架在实现可解释性的同时，支持多视频联合训练，能依据下游任务捕捉同一帧内超声图像块、同一视频内所有帧以及视频间的交互关系。通过考虑左心室射血分数估算和主动脉瓣狭窄严重程度检测两项关键任务，我们展示了框架的灵活性。模型在单视频和双视频EF估算任务中分别达到4.15和4.84的平均绝对误差，AS检测准确率达96.5%，同时提供了具有任务针对性的注意力图谱和原型可解释性。