Load-Dependent Branches (LDB) often do not exhibit regular patterns in their local or global history and thus are inherently hard to predict correctly by conventional branch predictors. We propose a software-to-hardware branch pre-resolution mechanism that allows software to pass branch outcomes to the processor frontend ahead of fetching the branch instruction. A compiler pass identifies the instruction chain leading to the branch (the branch backslice) and generates the pre-execute code that produces the branch outcomes ahead of the frontend observing them. The loop structure helps to unambiguously map the branch outcomes to their corresponding dynamic instances of the branch instruction. Our approach also allows for covering the loop iteration space selectively, with arbitrarily complex patterns. Our method for pre-execution enables important optimizations such as unrolling and vectorization, in order to substantially reduce the pre-execution overhead. Experimental results on select workloads from SPEC CPU 2017 and graph analytics workloads show up to 95% reduction of MPKI (21% on average), up to 39% speedup (7% on average), and up to 3x improvement on IPC (23% on average) compared to a core with TAGE-SC-L-64KB branch predictor.
翻译:负载相关分支(LDB)在局部或全局历史中通常不呈现规律性模式,因此传统分支预测器难以准确预测。我们提出一种软件到硬件的分支预解析机制,允许软件在取指分支指令之前将分支结果传递至处理器前端。编译器通道识别导致分支的指令链(分支回溯切片),并生成预执行代码,使其在前端观察到分支结果之前产生结果。循环结构有助于将分支结果明确映射到分支指令对应的动态实例。我们的方法还支持对循环迭代空间进行选择性覆盖,可处理任意复杂模式。该预执行方案通过循环展开和向量化等关键优化,大幅降低了预执行开销。在SPEC CPU 2017和图分析工作负载的测试中,与配备TAGE-SC-L-64KB分支预测器的处理器核相比,MPKI降低高达95%(平均21%),加速比提升达39%(平均7%),IPC提升达3倍(平均23%)。