Modern processors implement a decoupled front-end in the form of Fetch Directed Instruction Prefetching (FDIP) to avoid front-end stalls. FDIP is driven by the Branch Prediction Unit (BPU), relying on the BPU's accuracy and branch target tracking structures to speculatively fetch instructions into the Instruction Cache (L1I). As data center applications become more complex, their code footprints also grow, resulting in an increase in Branch Target Buffer (BTB) misses. FDIP can alleviate L1I cache misses, but when it encounters a BTB miss, the BPU may not identify the current instruction as a branch to FDIP. This can prevent FDIP from prefetching or cause it to speculate down the wrong path, further polluting the L1I cache. We observe that the vast majority, 75%, of BTB-missing, unidentified branches are actually present in instruction cache lines that FDIP has previously fetched but, these missing branches have not yet been decoded and inserted into the BTB. This is because the instruction line is decoded from an entry point (which is the target of the previous taken branch) till an exit point (the taken branch). Branch instructions present in the ignored portion of the cache line we call them "Shadow Branches". Here we present Skeia, a novel shadow branch decoding technique that identifies and decodes unused bytes in cache lines fetched by FDIP, inserting them into a Shadow Branch Buffer (SBB). The SBB is accessed in parallel with the BTB, allowing FDIP to speculate despite a BTB miss. With a minimal storage state of 12.25KB, Skeia delivers a geomean speedup of ~5.7% over an 8K-entry BTB (78KB) and ~2% versus adding an equal amount of state to the BTB across 16 front-end bound applications. Since many branches stored in the SBB are unique compared to those in a similarly sized BTB, we consistently observe greater performance gains with Skeia across all examined sizes until saturation.
翻译:现代处理器采用取指导向指令预取(FDIP)形式的解耦前端以避免前端停顿。FDIP由分支预测单元(BPU)驱动,依赖BPU的准确性和分支目标跟踪结构将指令推测性地预取到指令缓存(L1I)中。随着数据中心应用日趋复杂,其代码足迹不断增长,导致分支目标缓冲器(BTB)未命中率上升。FDIP虽能缓解L1I缓存未命中,但当遭遇BTB未命中时,BPU可能无法向FDIP识别当前指令为分支。这会阻碍FDIP进行预取或导致其沿错误路径推测,进而污染L1I缓存。我们观察到:高达75%的BTB未命中未识别分支,实际上存在于FDIP先前已预取的指令缓存行中,但这些缺失分支尚未被解码并插入BTB。这是因为指令行仅从入口点(即前一个已执行分支的目标地址)解码至出口点(已执行分支)。我们将缓存行中被忽略部分存在的分支指令称为“影子分支”。本文提出Skeia——一种创新的影子分支解码技术,可识别并解码FDIP预取缓存行中的未使用字节,将其插入影子分支缓冲器(SBB)。SBB与BTB并行访问,使FDIP即使在BTB未命中时仍能进行推测。仅需12.25KB的最小存储状态,Skeia在8K条目BTB(78KB)基础上实现约5.7%的几何平均加速,在16个前端受限应用中,相较于为BTB增加同等存储容量,性能提升约2%。由于SBB存储的许多分支相较于同等规模BTB具有独特性,我们在所有测试规模中均观察到Skeia持续带来更大性能增益,直至达到饱和点。