Sparse Convolution (SpC) powers 3D point cloud networks widely used in autonomous driving and augmented/virtual reality. SpC builds a kernel map that stores mappings between input voxel coordinates, output coordinates, and weight offsets, then uses this map to compute feature vectors for output coordinates. Our work identifies three key properties of voxel coordinates: they are integer-valued, bounded within a limited spatial range, and geometrically continuous, i.e., neighboring voxels on the same object surface are highly likely to exist at small spatial offsets from each other. Prior SpC engines do not fully exploit these properties and suffer from high pre-processing and post-processing overheads during kernel map construction. To address this, we design Spira, the first voxel-property-aware SpC engine for GPUs. Spira proposes (i) a high-performance one-shot search algorithm that builds the kernel map with no pre-processing and high data locality, (ii) an effective packed-native processing scheme that accesses packed voxel coordinates at low cost, (iii) a flexible dual-dataflow execution mechanism that efficiently computes output feature vectors by adapting to layer characteristics, and (iv) a network-wide parallelization strategy that builds kernel maps for all SpC layers concurrently at network start. Our evaluation shows that Spira significantly outperforms prior state-of-the-art SpC engines by 1.68x on average and up to 3.04x for end-to-end inference, and by 2.11x on average and up to 3.44x for layer-wise execution across diverse layer configurations. The source code of Spira is freely available at https://github.com/SPIN-Research-Group/Spira.
翻译:稀疏卷积(SpC)驱动着广泛应用于自动驾驶和增强/虚拟现实的三维点云网络。SpC构建一个存储输入体素坐标、输出坐标和权重偏移之间映射关系的核映射表,并利用该表计算输出坐标的特征向量。我们的研究发现了体素坐标的三个关键属性:它们是整数值、受限于有限空间范围、且几何连续(即同一物体表面上相邻体素很可能会在彼此较小的空间偏移下存在)。先前的SpC引擎未能充分利用这些属性,导致核映射表构建过程中的预处理和后处理开销较高。为解决此问题,我们设计了Spira——首个体素属性感知的GPU SpC引擎。Spira提出了:(i) 一种高性能一次搜索算法,无需预处理即可构建核映射表并实现高数据局部性;(ii) 一种高效的打包原生处理方案,能以低成本访问打包的体素坐标;(iii) 一种灵活的双数据流执行机制,通过适应层特性高效计算输出特征向量;以及(iv) 一种网络级并行化策略,在网络启动时并发构建所有SpC层的核映射表。我们的评估表明,Spira在端到端推理上平均性能提升1.68倍、最高达3.04倍,在跨多样化层配置的逐层执行上平均性能提升2.11倍、最高达3.44倍,显著优于先前最先进的SpC引擎。Spira的源代码已在https://github.com/SPIN-Research-Group/Spira 上免费公开。