基于图处理器的辐射泡沫渲染 (Radiant Foam Rendering on a Graph Processor)

Many emerging many-core accelerators replace a single large device memory with hundreds to thousands of lightweight cores, each owning only a small local SRAM and exchanging data via explicit on-chip communication. This organization offers high aggregate bandwidth, but it breaks a key assumption behind many volumetric rendering techniques: that rays can randomly access a large, unified scene representation. Rendering efficiently on such hardware therefore requires distributing both data and computation, keeping ray traversal mostly local, and structuring communication into predictable routes. We present a fully in-SRAM, distributed renderer for the Radiant Foam Voronoi-cell volumetric representation on the Graphcore Mk2 IPU(Intelligence Processing Unit), a many-core accelerator with tile-local SRAM and explicit inter-tile communication. Our system shards the scene across tiles and forwards rays between shards through a hierarchical routing overlay, enabling ray marching entirely from on-chip SRAM with predictable communication. On Mip-NeRF~360 scenes, the system attains near-interactive throughput of approximately 1 fps at 640x480 with image and depth map quality close to the original GPU-based Radiant Foam implementation, while keeping all scene data and ray state in on-chip SRAM. Beyond demonstrating feasibility, we analyze routing, memory, and scheduling bottlenecks that inform how future distributed-memory accelerators can better support irregular, data-movement-heavy rendering workloads.

翻译：许多新兴的众核加速器采用数百至数千个轻量级核心替代单一大型设备内存，每个核心仅拥有少量本地SRAM，并通过显式片上通信交换数据。这种架构提供了高聚合带宽，但它打破了众多体渲染技术背后的一个关键假设：光线能够随机访问一个大型、统一的场景表示。因此，在此类硬件上高效渲染需要同时分布数据和计算，使光线遍历主要保持局部性，并将通信结构化为可预测的路径。我们提出了一种完全在SRAM内运行的分布式渲染器，用于在Graphcore Mk2 IPU（智能处理单元）上渲染辐射泡沫Voronoi单元体表示；该IPU是一种具有片内本地SRAM和显式片间通信的众核加速器。我们的系统将场景分片分布到各个计算单元，并通过分层路由覆盖网络在分片间转发光线，从而实现了完全基于片上SRAM且通信可预测的光线步进。在Mip-NeRF~360场景上，该系统在640x480分辨率下达到了接近交互的吞吐量（约1 fps），图像和深度图质量接近原始的基于GPU的辐射泡沫实现，同时将所有场景数据和光线状态保持在片上SRAM中。除了证明可行性之外，我们还分析了路由、内存和调度瓶颈，这些分析为未来分布式内存加速器如何更好地支持不规则、数据移动密集的渲染工作负载提供了参考。