Accelerating 3D Gaussian Splatting using Tensor Cores

3D Gaussian Splatting (3DGS) has become a leading technique for real-time neural rendering and 3D scene reconstruction, but its rendering cost remains too high for many latency-sensitive scenarios. In particular, the rasterization stage in 3DGS dominates end-to-end rendering time, during which the renderer repeatedly evaluates each Gaussian's contribution to each covered pixel, making this stage compute-bound. At the same time, modern GPUs provide high-throughput Tensor Cores for low-precision matrix operations, yet existing 3DGS systems execute rasterization entirely on CUDA cores and leave Tensor Cores idle. We find that 3DGS rendering can be executed in FP16 with negligible quality degradation, suggesting a promising opportunity for Tensor Core acceleration. However, exploiting Tensor Cores for 3DGS is non-trivial because rasterization does not naturally match their execution model. Existing 3DGS rasterization is expressed as irregular per-pixel scalar operations, whereas Tensor Cores require dense, regular, and reuse-rich matrix workloads. Moreover, conventional tile-by-tile execution fails to exploit Gaussian reuse across neighboring tiles, resulting in repeated data loading and thus high data movement overhead. To this end, we present TensorGS, a 3DGS acceleration framework using Tensor Cores. TensorGS tensorizes the dominant rasterization computation into Tensor-Core-compatible matrix operations and introduces cross-tile grouping to improve Gaussian reuse, amortize overhead, and increase Tensor Core utilization. Experimental results show that TensorGS improves end-to-end rendering performance by 1.65$\times$ while preserving image quality.

翻译：三维高斯泼溅（3DGS）已成为实时神经渲染与三维场景重建的领先技术，但其渲染成本对于许多延迟敏感型场景而言仍然过高。具体而言，3DGS中的光栅化阶段主导了端到端渲染时间，在此过程中渲染器需反复评估每个高斯体对每个覆盖像素的贡献，导致该阶段成为计算密集型任务。与此同时，现代GPU为低精度矩阵运算提供了高吞吐量的张量核心，但现有3DGS系统完全在CUDA核心上执行光栅化，使张量核心处于闲置状态。我们发现3DGS渲染可在FP16精度下执行且质量退化可忽略，这为张量核心加速提供了可行契机。然而，利用张量核心加速3DGS并非易事，因为光栅化过程与其执行模式天然不匹配。现有3DGS光栅化表现为非规则的逐像素标量运算，而张量核心需要密集、规整且具有高复用性的矩阵负载。此外，传统的逐图块执行方式未能利用相邻图块间的高斯体复用，导致重复数据加载与高昂的数据搬移开销。为此，我们提出TensorGS——一种基于张量核心的3DGS加速框架。TensorGS将占主导地位的光栅化计算张量化，转化为与张量核心兼容的矩阵运算，并引入跨图块分组机制以提升高斯体复用效率、分摊开销并提高张量核心利用率。实验结果表明，TensorGS在保持图像质量的前提下，将端到端渲染性能提升1.65倍。