Hosting database services on cloud systems has become a common practice. This has led to the increasing volume of database workloads, which provides the opportunity for pattern analysis. Discovering workload patterns from a business logic perspective is conducive to better understanding the trends and characteristics of the database system. However, existing workload pattern discovery systems are not suitable for large-scale cloud databases which are commonly employed by the industry. This is because the workload patterns of large-scale cloud databases are generally far more complicated than those of ordinary databases. In this paper, we propose Alibaba Workload Miner (AWM), a real-time system for discovering workload patterns in complicated large-scale workloads. AWM encodes and discovers the SQL query patterns logged from user requests and optimizes the querying processing based on the discovered patterns. First, Data Collection & Preprocessing Module collects streaming query logs and encodes them into high-dimensional feature embeddings with rich semantic contexts and execution features. Next, Online Workload Mining Module separates encoded queries by business groups and discovers the workload patterns for each group. Meanwhile, Offline Training Module collects labels and trains the classification model using the labels. Finally, Pattern-based Optimizing Module optimizes query processing in cloud databases by exploiting discovered patterns. Extensive experimental results on one synthetic dataset and two real-life datasets (extracted from Alibaba Cloud databases) show that AWM enhances the accuracy of pattern discovery by 66% and reduce the latency of online inference by 22%, compared with the state-of-the-arts.
翻译:将数据库服务托管于云系统已成为一种普遍实践,这导致数据库工作负载量日益增长,为模式分析提供了契机。从业务逻辑视角发现工作负载模式,有助于更深入地理解数据库系统的趋势与特征。然而,现有工作负载模式发现系统并不适用于工业界广泛采用的大规模云数据库,其原因在于大规模云数据库的工作负载模式通常远复杂于普通数据库。本文提出阿里巴巴工作负载挖掘器(AWM),这是一套面向复杂大规模工作负载的实时模式发现系统。AWM对源自用户请求的SQL查询模式进行编码与发现,并基于所发现的模式优化查询处理。首先,数据采集与预处理模块收集流式查询日志,并将其编码为包含丰富语义背景与执行特征的高维特征嵌入;随后,在线工作负载挖掘模块按业务组划分编码后的查询,并发现各组的工作负载模式;同时,离线训练模块收集标签并利用标签训练分类模型;最后,基于模式的优化模块通过利用已发现的模式来优化云数据库中的查询处理。在合成数据集与两个真实数据集(从阿里云数据库中提取)上进行的广泛实验结果表明,与现有最优方法相比,AWM将模式发现准确率提升66%,并将在线推理延迟降低22%。