BARD：通过利用存储体并行性降低DDR5内存的写入延迟 (BARD: Reducing Write Latency of DDR5 Memory by Exploiting Bank-Parallelism)

This paper studies the impact of DRAM writes on DDR5-based system. To efficiently perform DRAM writes, modern systems buffer write requests and try to complete multiple write operations whenever the DRAM mode is switched from read to write. When the DRAM system is performing writes, it is not available to service read requests, thus increasing read latency and reducing performance. We observe that, given the presence of on-die ECC in DDR5 devices, the time to perform a write operation varies significantly: from 1x (for writes to banks of different bankgroups) to 6x (for writes to banks within the same bankgroup) to 24x (for conflicting requests to the same bank). If we can orchestrate the write stream to favor write requests that incur lower latency, then we can reduce the stall time from DRAM writes and improve performance. However, for current systems, the write stream is dictated by the cache replacement policy, which makes eviction decisions without being aware of the variable latency of DRAM writes. The key insight of our work is to improve performance by modifying the cache replacement policy to increase bank-parallelism of DRAM writes. Our paper proposes {\em BARD (Bank-Aware Replacement Decisions)}, which modifies the cache replacement policy to favor dirty lines that belong to banks without pending writes. We analyze two variants of BARD: BARD-E (Eviction-based), which changes the eviction policy to evict low-cost dirty lines, and BARD-C (Cleansing-Based), which proactively cleans low-cost dirty lines without modifying the eviction decisions. We develop a hybrid policy (BARD-H), which uses a selective combination of both eviction and writeback. Our evaluations across workloads from SPEC2017, LIGRA, STREAM, and Google server traces show that BARD-H improves performance by 4.3\% on average and up-to 8.5\%. BARD requires only 8 bytes of SRAM per LLC slice.

翻译：本文研究了DRAM写入对基于DDR5系统的影响。为高效执行DRAM写入，现代系统会缓冲写入请求，并尝试在DRAM模式从读取切换为写入时完成多次写入操作。当DRAM系统执行写入时，其无法响应读取请求，从而增加了读取延迟并降低了性能。我们观察到，鉴于DDR5器件中片上ECC的存在，执行写入操作的时间差异显著：从1倍（写入不同存储体组的存储体）到6倍（写入同一存储体组内的存储体）再到24倍（对同一存储体的冲突请求）。若能编排写入流以优先处理低延迟写入请求，则可减少DRAM写入导致的停滞时间并提升性能。然而，在当前系统中，写入流由缓存替换策略决定，该策略在做出逐出决策时并未考虑DRAM写入的可变延迟。我们工作的核心洞见是通过修改缓存替换策略来提升DRAM写入的存储体并行性，从而改善性能。本文提出{\em BARD（存储体感知替换决策）}，该方案修改缓存替换策略以优先选择属于无待处理写入存储体的脏缓存行。我们分析了BARD的两种变体：BARD-E（基于逐出），通过改变逐出策略来逐出低开销脏行；以及BARD-C（基于清理），在不修改逐出决策的前提下主动清理低开销脏行。我们进一步提出混合策略（BARD-H），该策略选择性结合逐出与写回操作。基于SPEC2017、LIGRA、STREAM及谷歌服务器追踪负载的评估表明，BARD-H平均提升性能4.3%，最高可达8.5%。BARD仅需为每个末级缓存切片分配8字节的SRAM。