Minuet: Accelerating 3D Sparse Convolutions on GPUs

Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary General Matrix Multiplication (GEMM) operations to be executed (Map step), and then use a Gather-GEMM-Scatter process to execute these GEMM operations (GMaS step). In this work, we analyze the shortcomings of prior state-of-the-art SC engines, and propose Minuet, a novel memory-efficient SC engine tailored for modern GPUs. Minuet proposes to (i) replace the hash tables used in the Map step with a novel segmented sorting double-traversed binary search algorithm that highly utilizes the on-chip memory hierarchy of GPUs, (ii) use a lightweight scheme to autotune the tile size in the Gather and Scatter operations of the GMaS step, such that to adapt the execution to the particular characteristics of each SC layer, dataset, and GPU architecture, and (iii) employ a padding-efficient GEMM grouping approach that reduces both memory padding and kernel launching overheads. Our evaluations show that Minuet significantly outperforms prior SC engines by on average $1.74\times$ (up to $2.22\times$) for end-to-end point cloud network executions. Our novel segmented sorting double-traversed binary search algorithm achieves superior speedups by $15.8\times$ on average (up to $26.8\times$) over prior SC engines in the Map step. The source code of Minuet is publicly available at https://github.com/UofT-EcoSystem/Minuet.

翻译：稀疏卷积（SC）广泛用于处理天然稀疏的3D点云。与密集卷积不同，SC通过仅允许在特定位置生成输出来保持输入点云的稀疏性。为高效计算SC，现有SC引擎首先使用哈希表构建存储待执行通用矩阵乘法（GEMM）运算的卷积核映射（Map步骤），随后通过"收集-GEMM-分散"流程执行这些GEMM运算（GMaS步骤）。本研究分析了现有最先进SC引擎的不足，提出Minuet——一种面向现代GPU的新型内存高效SC引擎。Minuet提出：(i) 在Map步骤中用新型分段排序双遍历二分搜索算法替代哈希表，该算法高度利用GPU的片上存储层次结构；(ii) 采用轻量级方案自动调整GMaS步骤中收集与分散操作的块大小，以适应各SC层、数据集及GPU架构的特定特征；(iii) 采用填充高效的GEMM分组方法，减少内存填充与内核启动开销。评估表明，在端到端点云网络执行中，Minuet相比现有SC引擎平均提速1.74倍（最高达2.22倍）。在Map步骤中，我们提出的新型分段排序双遍历二分搜索算法相比现有SC引擎实现平均15.8倍（最高达26.8倍）的显著加速。Minuet源代码开源地址为https://github.com/UofT-EcoSystem/Minuet。

相关内容

关注 0

SC：International Conference for High Performance Computing, Networking, Storage, and Analysis。 Explanation：高性能计算、网络、存储和分析国际会议。 Publisher：IEEE。 SIT: http://dblp.uni-trier.de/db/conf/sc/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日