Data prefetching--loading data into the cache before it is requested--is essential for reducing I/O overhead and improving database performance. While traditional prefetchers focus on sequential patterns, recent learning-based approaches, especially those leveraging data semantics, achieve higher accuracy for complex access patterns. However, these methods often struggle with today's dynamic, ever-growing datasets and require frequent, timely fine-tuning. Privacy constraints may also restrict access to complete datasets, necessitating prefetchers that can learn effectively from samples. To address these challenges, we present GrASP, a learning-based prefetcher designed for both analytical and transactional workloads. GrASP enhances prefetching accuracy and scalability by leveraging logical block address deltas and combining query representations with result encodings. It frames prefetching as a context-aware multi-label classification task, using multi-layer LSTMs to predict delta patterns from embedded context. This delta modeling approach enables GrASP to generalize predictions from small samples to larger, dynamic datasets without requiring extensive retraining. Experiments on real-world datasets and industrial benchmarks demonstrate that GrASP generalizes to datasets 250 times larger than the training data, achieving up to 45% higher hit ratios, 60% lower I/O time, and 55% lower end-to-end query execution latency than existing baselines. On average, GrASP attains a 91.4% hit ratio, a 90.8% I/O time reduction, and a 57.1% execution latency reduction.
翻译:数据预取——在数据被请求前将其加载至缓存——对于降低I/O开销与提升数据库性能至关重要。传统预取器主要关注顺序访问模式,而近期基于学习的方法(尤其是利用数据语义的技术)在复杂访问模式中实现了更高精度。然而,这些方法常难以应对当今动态增长的数据集,且需要频繁及时的微调。隐私约束也可能限制对完整数据集的访问,因此需要能够从数据样本中高效学习的预取器。为应对这些挑战,我们提出GrASP——一种面向分析型与事务型工作负载的基于学习的预取器。GrASP通过利用逻辑块地址增量,并将查询表征与结果编码相结合,提升了预取精度与可扩展性。该方法将预取构建为上下文感知的多标签分类任务,使用多层LSTM从嵌入上下文中预测增量模式。这种增量建模方法使GrASP能够将从小样本学习到的预测规律泛化至更大规模的动态数据集,而无需大量重新训练。在真实数据集与工业基准测试上的实验表明:GrASP可泛化至训练数据规模250倍的数据集,相比现有基线方法,其命中率最高提升45%,I/O时间降低60%,端到端查询执行延迟减少55%。平均而言,GrASP实现了91.4%的命中率、90.8%的I/O时间削减与57.1%的执行延迟降低。