Many medical ultrasound video recognition tasks involve identifying key anatomical features regardless of when they appear in the video suggesting that modeling such tasks may not benefit from temporal features. Correspondingly, model architectures that exclude temporal features may have better sample efficiency. We propose a novel multi-head attention architecture that incorporates these hypotheses as inductive priors to achieve better sample efficiency on common ultrasound tasks. We compare the performance of our architecture to an efficient 3D CNN video recognition model in two settings: one where we expect not to require temporal features and one where we do. In the former setting, our model outperforms the 3D CNN - especially when we artificially limit the training data. In the latter, the outcome reverses. These results suggest that expressive time-independent models may be more effective than state-of-the-art video recognition models for some common ultrasound tasks in the low-data regime.
翻译:许多医学超声视频识别任务涉及识别关键解剖特征,无论这些特征在视频中出现的时间。这表明对此类任务建模可能无法从时间特征中受益。相应地,排除时间特征的模型架构可能具有更好的样本效率。我们提出一种新颖的多头注意力架构,将上述假设作为归纳偏置嵌入其中,以在常见超声任务中实现更优的样本效率。我们将该架构与高效的三维卷积神经网络视频识别模型在两种场景下进行性能对比:一种预期不需要时间特征,另一种则需要时间特征。在前一场景中,我们的模型表现优于三维卷积神经网络——特别是在人为限制训练数据量的情况下;后一场景则结果相反。这些结果表明,在低数据量场景下,对于某些常见超声任务,具有高表达能力的非时间依赖模型可能比最先进的视频识别模型更具优势。