GPU workloads with large memory footprints frequently suffer from redundant L2 TLB misses in which a recently evicted translation is immediately re-walked at full page-walk cost. We characterize these dead-entry misses across 24 GPU workloads, finding they account for up to 99% of L2 TLB misses in the most TLB-sensitive applications, yet their performance impact varies widely depending on memory access structure. Workloads where warps share the same virtual page suffer from burst amplification, where a single eviction stalls many warps simultaneously waiting for one translation to return. In contrast, workloads where each warp accesses a distinct set of pages face a capacity-overflow problem that no replacement policy can resolve, a distinction validated by huge page experiments. Building on this two-class taxonomy, we design DEPOT (Dead-Entry PrOTection), a 1 KB Bloom filter mechanism that prevents recently evicted translations from being displaced immediately upon reinstallation, delivering up to 72% IPC improvement on interference-driven workloads with zero overhead on others, and composing with the state-of-the-art TLB prefetching and compaction mechanism, for 2 to 7% additional gain.
翻译:具有大内存足迹的GPU工作负载常遭受冗余的L2 TLB缺失,其中近期被逐出的地址翻译条目需立即以完整的页遍历代价重新加载。我们对24种GPU工作负载的空项缺失进行特征化分析,发现其对TLB最敏感的应用中,此类缺失占L2 TLB缺失量的99%。然而,其性能影响因内存访问结构而异:当线程束共享相同虚拟页面时,会出现"爆发放大"现象——单次逐出会阻塞多个线程束,使其同时等待一个翻译条目返回;相反,当每个线程束访问不同页面集时,则面临替换策略无法解决的容量溢出问题,该差异通过大页实验得到验证。基于此二元分类,我们设计了DEPOT(空项防护)机制——一个1KB布鲁姆过滤器,可防止近期被逐出的翻译条目在重新安装后立即被置换,在干扰主导型工作负载上实现高达72%的IPC提升,同时不增加其他工作负载的开销。该机制与先进TLB预取与压缩技术协同工作时,可获得额外2-7%的性能增益。