Efficiency in instruction fetching is critical to performance, and this requires the primary structures -- L1 instruction caches (L1i), branch target buffers (BTB) and instruction TLBs (iTLB) -- to have the requisite information when needed. This paper proposes a high-level program sequencing mechanism and a coupled technique for block movement, instruction presending, where instruction cache blocks, BTB entries, and iTLB entries are autonomously moved (or sent) from the secondary to the primary structures in a "just in time" fashion so that they are available when needed. Empirical results are presented to demonstrate the efficacy of the high-level sequencing mechanism and block movement. Presending is especially effective for benchmarks with a high base MPKI, where the movement of instruction blocks (and BTB/iTLB entries) from secondary to primary structures is frequent. Presending reduces the number of misses in primary structures by an order of magnitude as compared to state-of-the-art instruction prefetching schemes, in many cases, while allowing the processor to operate with small-sized primary BTBs. This reduction in misses results in performance improvements in cases where front-end efficiency is important.
翻译:指令获取效率对性能至关重要,这要求一级指令缓存(L1i)、分支目标缓冲器(BTB)和指令转换后备缓冲器(iTLB)等核心结构在需要时能够提供必需的信息。本文提出了一种高层程序序列机制及与之耦合的块移动技术——指令预送技术,该技术使指令缓存块、BTB条目和iTLB条目能够以“即时”方式从二级结构自主移动(或发送)至一级结构,从而确保其在需要时可用。实验结果表明了高层序列机制与块移动技术的有效性。对于基础每千条指令缺失数较高的基准测试程序,预送技术尤其有效,因为这些场景中指令块(及BTB/iTLB条目)从二级到一级结构的移动更为频繁。与当前最先进的指令预取方案相比,预送技术在一级结构缺失次数上实现了数量级的降低,同时在多数情况下允许处理器使用小容量的一级BTB。这种缺失次数的减少在需要前端效率的场景中带来了显著的性能提升。