Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
翻译:基于注意力的神经网络在内存访问预测(数据预取的关键步骤)中展现了卓越的准确率。然而,这类模型伴随的巨大计算开销导致推理延迟过高,限制了其作为实用预取器的可行性。为弥合这一差距,我们提出了一种基于表格化的新方法,能够在保持预测精度的同时显著降低模型复杂度与推理延迟。该表格化方法以经过蒸馏但高度准确的注意力内存访问预测模型为输入,通过高效层级化查表机制取代其高昂的矩阵乘法运算。作为上述方法的实例,我们开发了DART预取器——一种由简单层级化表格构成的预取方案。在F1分数仅下降0.09的前提下,DART将大规模注意力模型的算术运算量减少99.99%,蒸馏模型减少91.83%。相比原始模型,DART实现170倍推理加速;相比蒸馏模型实现9.4倍加速。DART的延迟与存储开销与最先进的基于规则的预取器BO相当,但IPC提升幅度高出6.1%。在IPC改善指标上,DART分别超越最先进的基于神经网络的预取器TransFetch (33.1%) 和Voyager (37.2%),这主要归功于其极低的预取延迟。