Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

翻译：超声心动图标准切面的自动分类对于优化临床工作流程至关重要，但面临三大挑战：首先，公开数据集稀缺且规模有限，切面覆盖不足；其次，部分现代视频级架构在超声心动图切面分类中的性能尚未充分探索；第三，某些切面类别在空间外观上高度相似，单帧特征不足以区分，而异质性帧质量又增加了鲁棒时间信息融合的难度。为应对这些挑战，我们发布了九切面超声心动图视频数据集（EV9V），包含5,138个视频、910,579帧和9个标准切面，据我们所知，这是目前最大的公开超声心动图视频数据集。基于EV9V，我们系统性地基准测试了代表性视频分类架构，包括卷积神经网络（CNN）、循环神经网络（RNN）和Transformer。此外，我们提出时空融合模型（STFM）——一种高效的双流CNN-LSTM（长短期记忆）框架，可联合捕获空间解剖结构与时间心脏动力学。该框架利用不确定性感知学习在训练中优先采样代表性视频片段，并在推理时通过基于证据的融合，提升对超声心动图视频帧质量变化的鲁棒性。大量实验表明，本方法在多种视频分类模型中均取得具有竞争力的性能，验证了不确定性感知时空学习在超声心动图切面分类中的有效性。代码已发布于https://github.com/bgx666/stfm。