We study the problem of optimizing data storage and access costs on the cloud while ensuring that the desired performance or latency is unaffected. We first propose an optimizer that optimizes the data placement tier (on the cloud) and the choice of compression schemes to apply, for given data partitions with temporal access predictions. Secondly, we propose a model to learn the compression performance of multiple algorithms across data partitions in different formats to generate compression performance predictions on the fly, as inputs to the optimizer. Thirdly, we propose to approach the data partitioning problem fundamentally differently than the current default in most data lakes where partitioning is in the form of ingestion batches. We propose access pattern aware data partitioning and formulate an optimization problem that optimizes the size and reading costs of partitions subject to access patterns. We study the various optimization problems theoretically as well as empirically, and provide theoretical bounds as well as hardness results. We propose a unified pipeline of cost minimization, called SCOPe that combines the different modules. We extensively compare the performance of our methods with related baselines from the literature on TPC-H data as well as enterprise datasets (ranging from GB to PB in volume) and show that SCOPe substantially improves over the baselines. We show significant cost savings compared to platform baselines, of the order of 50% to 83% on enterprise Data Lake datasets that range from terabytes to petabytes in volume.
翻译:我们研究了在确保目标性能或延迟不受影响的前提下,优化云端数据存储与访问成本的问题。首先,我们提出一种优化器,针对具有时间访问预测的数据分区,优化其在云端的存储层级配置及压缩方案选择。其次,我们提出一种模型,学习不同格式数据分区上多种算法的压缩性能,从而实时生成压缩性能预测,作为优化器的输入。第三,我们提出一种与当前多数数据湖默认的基于摄取批次的分区方式根本不同的数据分区方法:基于访问模式感知的数据分区,并构建一个优化问题,在约束访问模式条件下优化分区大小与读取成本。我们从理论和实证两个角度研究这些优化问题,给出理论界与难解性结论。我们提出一个统一的成本最小化管道——SCOPe,集成各模块。我们基于TPC-H数据集及企业级数据集(规模从GB到PB)与文献中相关基线进行广泛性能对比,结果显示SCOPe显著优于基线。相较于平台基线,我们在规模达TB至PB的企业数据湖数据集上实现约50%至83%的成本节约。