We study the problem of optimizing data storage and access costs on the cloud while ensuring that the desired performance or latency is unaffected. We first propose an optimizer that optimizes the data placement tier (on the cloud) and the choice of compression schemes to apply, for given data partitions with temporal access predictions. Secondly, we propose a model to learn the compression performance of multiple algorithms across data partitions in different formats to generate compression performance predictions on the fly, as inputs to the optimizer. Thirdly, we propose to approach the data partitioning problem fundamentally differently than the current default in most data lakes where partitioning is in the form of ingestion batches. We propose access pattern aware data partitioning and formulate an optimization problem that optimizes the size and reading costs of partitions subject to access patterns. We study the various optimization problems theoretically as well as empirically, and provide theoretical bounds as well as hardness results. We propose a unified pipeline of cost minimization, called SCOPe that combines the different modules. We extensively compare the performance of our methods with related baselines from the literature on TPC-H data as well as enterprise datasets (ranging from GB to PB in volume) and show that SCOPe substantially improves over the baselines. We show significant cost savings compared to platform baselines, of the order of 50% to 83% on enterprise Data Lake datasets that range from terabytes to petabytes in volume.
翻译:本文研究在确保不影响所需性能或延迟的前提下,优化云上数据存储与访问成本的问题。首先,我们提出一个优化器,针对具有时间访问预测的给定数据分区,优化数据驻留层级(云上)及压缩方案的选择。其次,我们提出一个模型,用于学习多种算法在不同格式数据分区上的压缩性能,从而在线生成压缩性能预测,作为优化器的输入。第三,我们从根本上提出区别于当前大多数数据湖默认分区方式(即按摄取批次分区)的方案,采用基于访问模式的数据分区,并构建一个优化问题,在考虑访问模式的约束下优化分区大小与读取成本。我们从理论和实证角度研究各类优化问题,提供理论界与复杂度结论。我们提出一个统一的成本最小化流水线SCOPe,融合上述不同模块。我们基于TPC-H数据及企业级数据集(规模从GB级到PB级),将所提方法与文献中的相关基线进行广泛对比,结果表明SCOPe显著优于基线方法。与平台基线相比,在规模从太字节(TB)到拍字节(PB)的企业数据湖数据集上,我们实现了约50%至83%的成本节省。