Modern cloud data warehouses store data in micro-partitions and rely on metadata (e.g., zonemaps) for efficient data pruning during query processing. Maintaining data clustering in a large-scale table is crucial for effective data pruning. Existing automatic clustering approaches lack the flexibility required in dynamic cloud environments with continuous data ingestion and evolving workloads. This paper advocates a clean separation between reclustering policy and clustering-key selection. We introduce the concept of boundary micro-partitions that sit on the boundary of query ranges. We then present WAIR, a workload-aware algorithm to identify and recluster only boundary micro-partitions most critical for pruning efficiency. WAIR achieves near-optimal (with respect to fully sorted table layouts) query performance but incurs significantly lower reclustering cost with a theoretical upper bound. We further implement the algorithm into a prototype reclustering service and evaluate on standard benchmarks (TPC-H, DSB) and a real-world workload. Results show that WAIR improves query performance and reduces the overall cost compared to existing solutions.
翻译:现代云数据仓库将数据存储在微分区中,并依赖元数据(如区域映射)在查询处理期间实现高效的数据剪枝。在大规模表中维持数据聚簇对于实现有效的数据剪枝至关重要。现有的自动聚簇方法在具有持续数据摄入和动态工作负载的云环境中缺乏所需的灵活性。本文主张在重聚类策略与聚类键选择之间实现清晰分离。我们引入了边界微分区的概念,这些分区位于查询范围的边界上。随后我们提出了WAIR算法,这是一种基于工作负载感知的算法,用于识别并仅对剪枝效率最为关键的边界微分区进行重聚类。WAIR实现了接近最优(相对于完全有序的表布局)的查询性能,同时以理论上的成本上限显著降低了重聚类开销。我们进一步将该算法实现为原型重聚类服务,并在标准基准测试(TPC-H、DSB)和真实工作负载上进行了评估。结果表明,与现有解决方案相比,WAIR提升了查询性能并降低了总体成本。