GrASP：一种适用于可扩展事务与分析工作负载的通用化地址语义预取器 (GrASP: A Generalizable Address-based Semantic Prefetcher for Scalable Transactional and Analytical Workloads)

Data prefetching--loading data into the cache before it is requested--is essential for reducing I/O overhead and improving database performance. While traditional prefetchers focus on sequential patterns, recent learning-based approaches, especially those leveraging data semantics, achieve higher accuracy for complex access patterns. However, these methods often struggle with today's dynamic, ever-growing datasets and require frequent, timely fine-tuning. Privacy constraints may also restrict access to complete datasets, necessitating prefetchers that can learn effectively from samples. To address these challenges, we present GrASP, a learning-based prefetcher designed for both analytical and transactional workloads. GrASP enhances prefetching accuracy and scalability by leveraging logical block address deltas and combining query representations with result encodings. It frames prefetching as a context-aware multi-label classification task, using multi-layer LSTMs to predict delta patterns from embedded context. This delta modeling approach enables GrASP to generalize predictions from small samples to larger, dynamic datasets without requiring extensive retraining. Experiments on real-world datasets and industrial benchmarks demonstrate that GrASP generalizes to datasets 250 times larger than the training data, achieving up to 45% higher hit ratios, 60% lower I/O time, and 55% lower end-to-end query execution latency than existing baselines. On average, GrASP attains a 91.4% hit ratio, a 90.8% I/O time reduction, and a 57.1% execution latency reduction.

翻译：数据预取——在数据被请求前将其加载至缓存——对于降低I/O开销与提升数据库性能至关重要。传统预取器主要关注顺序访问模式，而近期基于学习的方法（尤其是利用数据语义的技术）在复杂访问模式中实现了更高精度。然而，这些方法常难以应对当今动态增长的数据集，且需要频繁及时的微调。隐私约束也可能限制对完整数据集的访问，因此需要能够从数据样本中高效学习的预取器。为应对这些挑战，我们提出GrASP——一种面向分析型与事务型工作负载的基于学习的预取器。GrASP通过利用逻辑块地址增量，并将查询表征与结果编码相结合，提升了预取精度与可扩展性。该方法将预取构建为上下文感知的多标签分类任务，使用多层LSTM从嵌入上下文中预测增量模式。这种增量建模方法使GrASP能够将从小样本学习到的预测规律泛化至更大规模的动态数据集，而无需大量重新训练。在真实数据集与工业基准测试上的实验表明：GrASP可泛化至训练数据规模250倍的数据集，相比现有基线方法，其命中率最高提升45%，I/O时间降低60%，端到端查询执行延迟减少55%。平均而言，GrASP实现了91.4%的命中率、90.8%的I/O时间削减与57.1%的执行延迟降低。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日