UNesT: Local Spatial Representation Learning with Hierarchical Transformer for Efficient Medical Segmentation

Xin Yu,Qi Yang,Yinchi Zhou,Leon Y. Cai,Riqiang Gao,Ho Hin Lee,Thomas Li,Shunxing Bao,Zhoubing Xu,Thomas A. Lasko,Richard G. Abramson,Zizhao Zhang,Yuankai Huo,Bennett A. Landman,Yucheng Tang

from arxiv, 19 pages, 17 figures. arXiv admin note: text overlap with arXiv:2203.02430

Transformer-based models, capable of learning better global dependencies, have recently demonstrated exceptional representation learning capabilities in computer vision and medical image analysis. Transformer reformats the image into separate patches and realizes global communication via the self-attention mechanism. However, positional information between patches is hard to preserve in such 1D sequences, and loss of it can lead to sub-optimal performance when dealing with large amounts of heterogeneous tissues of various sizes in 3D medical image segmentation. Additionally, current methods are not robust and efficient for heavy-duty medical segmentation tasks such as predicting a large number of tissue classes or modeling globally inter-connected tissue structures. To address such challenges and inspired by the nested hierarchical structures in vision transformer, we proposed a novel 3D medical image segmentation method (UNesT), employing a simplified and faster-converging transformer encoder design that achieves local communication among spatially adjacent patch sequences by aggregating them hierarchically. We extensively validate our method on multiple challenging datasets, consisting of multiple modalities, anatomies, and a wide range of tissue classes, including 133 structures in the brain, 14 organs in the abdomen, 4 hierarchical components in the kidneys, inter-connected kidney tumors and brain tumors. We show that UNesT consistently achieves state-of-the-art performance and evaluate its generalizability and data efficiency. Particularly, the model achieves whole brain segmentation task complete ROI with 133 tissue classes in a single network, outperforming prior state-of-the-art method SLANT27 ensembled with 27 networks.

翻译：基于Transformer的模型能够学习更好的全局依赖关系，近年来在计算机视觉和医学图像分析中展现出卓越的表示学习能力。Transformer将图像划分为独立块，并通过自注意力机制实现全局通信。然而，在三维医学图像分割中，这类一维序列难以保留块间的位置信息，导致处理大量尺寸各异、异质性强的组织时性能欠佳。此外，现有方法在应对高负荷医学分割任务（如预测大量组织类别或建模全局关联的组织结构）时鲁棒性和效率不足。为解决上述挑战，受视觉Transformer中嵌套分层结构的启发，我们提出了一种新颖的三维医学图像分割方法（UNesT），通过分层聚合空间相邻的块序列，设计出简化且收敛更快的Transformer编码器，实现局部通信。我们在多个具有挑战性的数据集上进行了广泛验证，这些数据集涵盖多种模态、解剖结构及大量组织类别，包括脑部133个结构、腹部14个器官、肾脏4个分层组件、互联的肾脏肿瘤和脑肿瘤。实验表明，UNesT持续达到最先进的性能，并验证了其泛化能力和数据效率。特别地，该模型在单个网络中完成包含133个组织类别的全脑分割任务，性能超越先前需27个网络集成的先进方法SLANT27。