片上SRAM辐射泡沫渲染在图处理器上的实现 (In-SRAM Radiant Foam Rendering on a Graph Processor)

Many emerging many-core accelerators replace a single large device memory with hundreds to thousands of lightweight cores, each owning only a small local SRAM and exchanging data via explicit on-chip communication. This organization offers high aggregate bandwidth, but it breaks a key assumption behind many volumetric rendering techniques: that rays can randomly access a large, unified scene representation. Rendering efficiently on such hardware therefore requires distributing both data and computation, keeping ray traversal mostly local, and structuring communication into predictable routes. We present a fully in-SRAM, distributed renderer for the \emph{Radiant Foam} Voronoi-cell volumetric representation on the Graphcore Mk2 IPU, a many-core accelerator with tile-local SRAM and explicit inter-tile communication. Our system shards the scene across tiles and forwards rays between shards through a hierarchical routing overlay, enabling ray marching entirely from on-chip SRAM with predictable communication. On Mip-NeRF~360 scenes, the system attains near-interactive throughput ($\approx$1\,fps at \mbox{$640\times480$}) with image and depth quality close to the original GPU-based Radiant Foam implementation, while keeping all scene data and ray state in on-chip SRAM. Beyond demonstrating feasibility, we analyze routing, memory, and scheduling bottlenecks that inform how future distributed-memory accelerators can better support irregular, data-movement-heavy rendering workloads.

翻译：许多新兴的众核加速器采用数百至数千个轻量级核心替代单一大型设备内存，每个核心仅拥有少量本地SRAM并通过显式片上通信交换数据。这种架构提供了高聚合带宽，但它打破了众多体渲染技术背后的关键假设：光线能够随机访问大型统一的场景表示。因此在此类硬件上实现高效渲染需要同时分布数据和计算，保持光线遍历的局部性，并将通信组织为可预测的路径。本文基于Graphcore Mk2 IPU（一种具备分片本地SRAM和显式分片间通信的众核加速器），针对Voronoi单元体表示形式的辐射泡沫表示法，提出了一种完全在SRAM内运行的分布式渲染器。我们的系统将场景分片分布至各计算单元，并通过分层路由覆盖网络在分片间转发光线，实现了完全从片上SRAM进行光线步进且通信可预测的渲染流程。在Mip-NeRF 360场景测试中，系统在保持图像和深度质量接近原版基于GPU的辐射泡沫实现的同时，达到了接近交互级的吞吐量（约1帧/秒，分辨率640×480），且所有场景数据与光线状态均驻留于片上SRAM。除验证可行性外，本文进一步分析了路由、内存与调度瓶颈，为未来分布式内存加速器如何更好地支持不规则且数据移动密集的渲染工作负载提供了设计依据。