TTP: A Hardware-Efficient Design for Precise Prefetching in Ray Tracing

Ray tracing (RT) is a 3D graphics technique that offers highly realistic visuals. It is becoming prominent and accessible as GPU vendors have integrated dedicated ray tracing acceleration hardware. However, tracing millions of rays through 3D scenes consisting of high numbers of triangles in real time is challenging and requires expensive hardware. The main bottleneck in RT workloads is the expensive Bounding Volume Hierarchy (BVH) traversal task, which is a large tree structure that encodes the 3D scene. BVH traversal is a memory-bound problem, as the GPU threads spend most of their time reading tree node data from memory. In this work, we attack the memory latency bottleneck of ray tracing through prefetching. We propose a novel hardware prefetcher, named Tree Traversal Prefetcher (TTP), for ray tracing. The main idea is to leverage the existing tree traversal stack in the RT units for highly accurate prefetching. In particular, TTP prefetches nodes using the addresses already available on the hardware traversal stacks of each thread. For DFS (Depth-first search) based traversal, prefetches are generated when nodes are being popped consecutively from the traversal stack, potentially corresponding to upward traversal through the tree. We evaluate TTP on a cycle-level simulator, Vulkan-sim 2.0, and show that it achieves 1.48x speedup on average (up to 1.89x) compared to the baseline, with nearly negligible hardware overhead. TTP achieves 98.92% average L1 accuracy, which is the ratio of the prefetched blocks being actually referenced by demand loads. The coverage, computed as the ratio of L1 miss reduction over baseline L1 misses, is 31.54%, correlating well with the achieved speedup.

翻译：光线追踪是一种能够生成高度逼真画面的3D图形技术。随着GPU厂商集成专用光线追踪加速硬件，该技术日益普及且易于使用。然而，在实时场景中追踪数百万条穿过由大量三角形构成的三维场景的光线极具挑战性，需要昂贵的硬件支持。光线追踪工作负载的主要瓶颈在于昂贵的包围盒层次结构遍历任务——这是一种编码三维场景的大型树形结构。BVH遍历属于内存受限问题，因为GPU线程大部分时间都在从内存中读取树节点数据。本文通过预取技术攻克光线追踪中的内存延迟瓶颈。我们提出一种名为树遍历预取器的新型硬件预取器，其核心思想是利用光线追踪单元中现有的树遍历栈实现高精度预取。具体而言，TTP通过每个线程硬件遍历栈上已有的地址预取节点。对于基于深度优先搜索的遍历，当节点连续从遍历栈弹出时（对应树中向上遍历的过程），会生成预取请求。我们在周期级模拟器Vulkan-sim 2.0上评估TTP，结果表明相较于基线方案，平均加速比达1.48倍（最高可达1.89倍），且硬件开销几乎可忽略不计。TTP的L1预取准确率平均为98.92%（即预取块被需求加载实际引用的比例），L1缺失减少率（相较于基线L1缺失的比值）为31.54%，该覆盖率与加速比呈现良好相关性。