Localization of the narrowest position of the vessel and corresponding vessel and remnant vessel delineation in carotid ultrasound (US) are essential for carotid stenosis grading (CSG) in clinical practice. However, the pipeline is time-consuming and tough due to the ambiguous boundaries of plaque and temporal variation. To automatize this procedure, a large number of manual delineations are usually required, which is not only laborious but also not reliable given the annotation difficulty. In this study, we present the first video classification framework for automatic CSG. Our contribution is three-fold. First, to avoid the requirement of laborious and unreliable annotation, we propose a novel and effective video classification network for weakly-supervised CSG. Second, to ease the model training, we adopt an inflation strategy for the network, where pre-trained 2D convolution weights can be adapted into the 3D counterpart in our network. In this way, the existing pre-trained large model can be used as an effective warm start for our network. Third, to enhance the feature discrimination of the video, we propose a novel attention-guided multi-dimension fusion (AMDF) transformer encoder to model and integrate global dependencies within and across spatial and temporal dimensions, where two lightweight cross-dimensional attention mechanisms are designed. Our approach is extensively validated on a large clinically collected carotid US video dataset, demonstrating state-of-the-art performance compared with strong competitors.
翻译:血管最窄位置的定位以及颈动脉超声中相应血管和残余血管的勾画是临床实践中颈动脉狭窄分级的关键步骤。然而,由于斑块边界模糊和时序变化,这一流程耗时且困难。为自动执行该流程,通常需要大量手动勾画,这不仅费力,而且鉴于标注难度,数据可靠性也难以保证。本研究首次提出用于自动颈动脉狭窄分级的视频分类框架。我们的贡献体现在三个方面:第一,为避免费时且不可靠的标注需求,我们提出一种新颖有效的弱监督颈动脉狭窄分级视频分类网络。第二,为简化模型训练,我们对网络采用膨胀策略,使预训练的二维卷积权重可适配至网络中的三维对应部分。这样,现有预训练大模型可作为网络有效的热启动方式。第三,为增强视频特征判别能力,我们提出一种新颖的注意力引导多维融合Transformer编码器,用于建模并整合空间与时间维度内及跨维度的全局依赖关系,其中设计了两种轻量级跨维度注意力机制。该方法在临床采集的大规模颈动脉超声视频数据集上得到充分验证,与强基线方法相比展现出最先进的性能。