Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
翻译:基于注意力的神经网络(NN)已在内存访问预测(数据预取的关键步骤)中展现出准确预测的有效性。然而,这些模型伴随的大量计算开销导致推理延迟过高,限制了其作为实用预取器的可行性。为弥合这一差距,我们提出了一种基于表格化的新方法,该方法在保持预测精度的同时显著降低了模型复杂度和推理延迟。我们的创新表格化方法以经过蒸馏但高精度的注意力模型作为输入,用于内存访问预测,并将其昂贵的矩阵乘法高效地转换为快速表查找的分层结构。作为上述方法的实例,我们开发了DART——一个由简单表格层级结构构成的预取器。在F1分数仅下降0.09的情况下,DART将大型注意力模型的算术操作减少99.99%,将蒸馏模型的操作减少91.83%。DART将大型模型推理加速170倍,将蒸馏模型加速9.4倍。DART的延迟和存储成本与最先进的基于规则的预取器BO相当,但在IPC提升方面超越其6.1%。DART在IPC提升方面分别领先最先进的基于NN的预取器TransFetch 33.1%和Voyager 37.2%,主要归功于其低预取延迟。