Exploring shared memory architectures for end-to-end gigapixel deep learning

Lucas W. Remedios,Leon Y. Cai,Samuel W. Remedios,Karthik Ramadass,Aravind Krishnan,Ruining Deng,Can Cui,Shunxing Bao,Lori A. Coburn,Yuankai Huo,Bennett A. Landman

Deep learning has made great strides in medical imaging, enabled by hardware advances in GPUs. One major constraint for the development of new models has been the saturation of GPU memory resources during training. This is especially true in computational pathology, where images regularly contain more than 1 billion pixels. These pathological images are traditionally divided into small patches to enable deep learning due to hardware limitations. In this work, we explore whether the shared GPU/CPU memory architecture on the M1 Ultra systems-on-a-chip (SoCs) recently released by Apple, Inc. may provide a solution. These affordable systems (less than \$5000) provide access to 128 GB of unified memory (Mac Studio with M1 Ultra SoC). As a proof of concept for gigapixel deep learning, we identified tissue from background on gigapixel areas from whole slide images (WSIs). The model was a modified U-Net (4492 parameters) leveraging large kernels and high stride. The M1 Ultra SoC was able to train the model directly on gigapixel images (16000$\times$64000 pixels, 1.024 billion pixels) with a batch size of 1 using over 100 GB of unified memory for the process at an average speed of 1 minute and 21 seconds per batch with Tensorflow 2/Keras. As expected, the model converged with a high Dice score of 0.989 $\pm$ 0.005. Training up until this point took 111 hours and 24 minutes over 4940 steps. Other high RAM GPUs like the NVIDIA A100 (largest commercially accessible at 80 GB, $\sim$\$15000) are not yet widely available (in preview for select regions on Amazon Web Services at \$40.96/hour as a group of 8). This study is a promising step towards WSI-wise end-to-end deep learning with prevalent network architectures.

翻译：深度学习在医学影像领域取得了巨大进展，这得益于GPU硬件的发展。然而，训练过程中GPU内存资源的饱和已成为开发新模型的主要制约因素，在计算病理学中尤其如此，因为该领域图像通常包含超过10亿像素。受硬件限制，这类病理图像传统上需要分割成小图块才能进行深度学习。本研究探讨了苹果公司（Apple, Inc.）近期发布的M1 Ultra片上系统（SoC）中共享GPU/CPU内存架构是否能够提供解决方案。这些价格适中的系统（低于5000美元）提供了128 GB统一内存的访问能力（配备M1 Ultra SoC的Mac Studio）。作为十亿像素深度学习的概念验证，我们从全切片图像（WSI）的十亿像素区域中识别出组织背景。该模型采用改进的U-Net（4492个参数），利用大卷积核和高步长。M1 Ultra SoC能够直接在十亿像素图像（16000×64000像素，10.24亿像素）上以批大小1训练模型，使用超过100 GB的统一内存进行计算，平均每批处理速度为1分21秒（基于TensorFlow 2/Keras）。与预期一致，模型收敛时Dice得分高达0.989 ± 0.005。训练过程耗时111小时24分钟，共4940步。其他高内存GPU（如NVIDIA A100，商业可访问最大内存为80 GB，约15000美元）目前尚未广泛普及（在亚马逊云服务上以8组形式预览，价格为40.96美元/小时）。本研究是迈向利用主流网络架构实现全切片级别端到端深度学习的重要一步。