The automatic classification of 3D medical data is memory-intensive. Also, variations in the number of slices between samples is common. Na\"ive solutions such as subsampling can solve these problems, but at the cost of potentially eliminating relevant diagnosis information. Transformers have shown promising performance for sequential data analysis. However, their application for long sequences is data, computationally, and memory demanding. In this paper, we propose an end-to-end Transformer-based framework that allows to classify volumetric data of variable length in an efficient fashion. Particularly, by randomizing the input volume-wise resolution(#slices) during training, we enhance the capacity of the learnable positional embedding assigned to each volume slice. Consequently, the accumulated positional information in each positional embedding can be generalized to the neighbouring slices, even for high-resolution volumes at the test time. By doing so, the model will be more robust to variable volume length and amenable to different computational budgets. We evaluated the proposed approach in retinal OCT volume classification and achieved 21.96% average improvement in balanced accuracy on a 9-class diagnostic task, compared to state-of-the-art video transformers. Our findings show that varying the volume-wise resolution of the input during training results in more informative volume representation as compared to training with fixed number of slices per volume.
翻译:三维医学数据的自动分类需要大量内存,且不同样本的切片数量差异普遍存在。诸如子采样等简单方法虽能解决这些问题,但可能损失相关诊断信息。Transformer在序列数据分析中展现出良好性能,但其在长序列上的应用对数据、计算和内存要求较高。本文提出一种基于Transformer的端到端框架,能够高效分类变长体数据。具体而言,通过在训练过程中随机化输入体素分辨率(切片数量),我们增强了对每个体素切片可学习位置嵌入的编码能力。由此,各位置嵌入中累积的位置信息可泛化至相邻切片,即使在测试时面对高分辨率体数据也能适用。这使得模型对变长体数据更鲁棒,且能适应不同计算资源约束。我们在视网膜OCT体分类任务上评估了该方法,在9类诊断任务中,相较于最先进的视频Transformer,平衡准确率平均提升21.96%。研究表明,与固定切片数量的训练相比,训练时改变输入体素分辨率能获得更丰富的体表征。