The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Naive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long-sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input slice-wise resolution during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the slice-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume. Our code is available at: https://github.com/marziehoghbaie/VLFAT.
翻译:三维医学影像数据的自动分类对内存消耗较大,且不同样本间的切片数量差异十分常见。下采样等简单方法虽可解决上述问题,但可能损失关键诊断信息。Transformer在序列数据分析中展现出良好性能,但其在长序列场景下对数据、计算和内存资源要求较高。本文提出一种基于Transformer的端到端框架,可高效分类长度可变的体数据。具体创新在于:通过在训练过程中随机化输入切片的层间分辨率,增强每个体素切片可学习位置编码的容量。由此,各位置编码累积的位置信息可泛化至相邻切片,即便测试时面对高分辨率体数据也能保持鲁棒性。该方法不仅增强了对可变体数据长度的适应性,还能灵活适配不同计算资源限制。我们在视网膜OCT体数据分类任务中评估该方法,与当前最优视频Transformer相比,在9分类诊断任务中实现均衡准确率平均提升21.96%。实验表明:相较于固定切片数量的训练方式,在训练阶段动态调整输入切片的层间分辨率可生成信息更丰富的体数据表征。相关代码已开源至:https://github.com/marziehoghbaie/VLFAT。