Rendering large-scale, unbounded scenes on AR/VR-class devices is constrained by the computation, bandwidth, and storage cost of 3D Gaussian Splatting (3DGS). We propose a low-power, low-cost 3DGS hardware accelerator that renders full-HD images in real time, together with a hardware-friendly compression pipeline that combines iterative Gaussian pruning and fine-tuning, progressive spherical harmonics (SH) degree reduction, and vector quantization of all SH coefficients and colors. The scheme achieves a $51.6\times$ model-size reduction with a 0.743 dB PSNR loss. The accelerator uses a frame-level pipeline that integrates point-based culling and projection with tile-based sorting and rasterization, skips zero-Jacobian matrix multiplications (reducing processing elements by 63\% and computation by 53\%), and adopts comparison-free tile-based sorting with deterministic latency. Implemented in a TSMC 28-nm process at 800 MHz, the design occupies $0.66~\text{mm}^2$ with 1.1438 M gates and 120 kB SRAM, consumes 0.219 W, and delivers 1219 Mpixels/J at 267.5 Mpixels/s, enabling 1080p at 129 FPS. Overall, it is $5.98\times$ smaller in area, $5.94\times$ higher throughput, and delivers $7.5\times$ higher energy efficiency than prior 3DGS accelerators.
翻译:在AR/VR类设备上渲染大规模无界场景受限于三维高斯泼溅(3DGS)的计算、带宽和存储开销。本文提出一种低功耗、低成本的3DGS硬件加速器,可实时渲染全高清图像,并配套设计了一种硬件友好的压缩流水线,该流水线结合了迭代式高斯剪枝与微调、渐进式球谐函数(SH)阶数降低,以及对所有SH系数和颜色的向量量化。该方案实现了51.6倍的模型尺寸压缩,PSNR损失仅为0.743 dB。加速器采用帧级流水线结构,集成基于点的剔除与投影以及基于图块的排序与光栅化,跳过零雅可比矩阵乘法(减少63%的处理单元和53%的计算量),并采用延迟确定性的免比较图块排序方法。该设计基于台积电28纳米工艺实现,工作频率800 MHz,占用0.66 mm²面积(含114.38万逻辑门和120 kB SRAM),功耗0.219 W,能效达1219 Mpixels/J,吞吐量为267.5 Mpixels/s,可实现1080p分辨率下129 FPS的实时渲染。与先前的3DGS加速器相比,本设计面积缩小5.98倍,吞吐量提升5.94倍,能效提高7.5倍。