In this work, we propose FFDP, a set of IO-aware non-GEMM fused kernels supplemented with a distributed framework for image registration at unprecedented scales. Image registration is an inverse problem fundamental to biomedical and life sciences, but algorithms have not scaled in tandem with image acquisition capabilities. Our framework complements existing model parallelism techniques proposed for large-scale transformer training by optimizing non-GEMM bottlenecks and enabling convolution-aware tensor sharding. We demonstrate unprecedented capabilities by performing multimodal registration of a 100 micron ex-vivo human brain MRI volume at native resolution - an inverse problem more than 570x larger than a standard clinical datum in about a minute using only 8 A6000 GPUs. FFDP accelerates existing state-of-the-art optimization and deep learning registration pipelines by upto 6 - 7x while reducing peak memory consumption by 20 - 59%. Comparative analysis on a 250 micron dataset shows that FFDP can fit upto 64x larger problems than existing SOTA on a single GPU, and highlights both the performance and efficiency gains of FFDP compared to SOTA image registration methods.
翻译:本文提出FFDP,一套结合分布式框架的I/O感知非GEMM融合内核,用于实现前所未有规模的图像配准。图像配准是生物医学与生命科学领域的基础性逆问题,但现有算法未能与图像采集能力同步扩展。本框架通过优化非GEMM瓶颈并实现卷积感知张量分片,对现有面向大规模Transformer训练的模型并行技术形成补充。我们通过以原生分辨率对100微米离体人脑MRI体积进行多模态配准(该逆问题规模超标准临床数据570余倍),在仅使用8块A6000 GPU的情况下约一分钟内完成计算,展示了前所未有的处理能力。FFDP将现有最先进的优化与深度学习配准流程加速6-7倍,同时峰值内存消耗降低20-59%。在250微米数据集上的对比分析表明,FFDP在单GPU上可处理的问题规模达到现有最优方法的64倍,凸显了相较于最先进图像配准方法在性能与效率上的双重优势。