Index-Assisted Stratified Sampling for Online Aggregation

Ad-hoc queries over frequently updated data in a flat schema are common in real-time data analysis applications and often require very low latency. Online aggregation can achieve so by providing approximate aggregation answers with confidence bound guarantees. It relies on the ability to draw samples online in a linear time to sample size rather than database size, which can be supported by index-assisted Sampling-based Approximate Query Processing (S-AQP) systems. However, the query latencies of approximate queries in these systems can still suffer from excessive sampling cost required to achieve a desired confidence bound, due to increased sample size for data with high variance in value distribution and selectivity. Classic stratified sampling methods with Neyman allocation can minimize sample size in theory, but several challenges prevent it from being applicable in index-assisted S-AQP systems, including requiring apriori statistics, high optimization cost, and inaccurate sampling cost model based on sample size. Towards that, we design index-assisted stratified sampling for online aggregation, which features a two-phase sampling framework. Samples drawn from first phase are used for both online aggregation and optimizing future sampling cost, while the second phase continues the online aggregation using the optimized strata. We prove optimal stratification and sample size allocation strategies for index-based sampling cost model, and design several greedy and dynamic programming based optimization methods to balance optimization cost and effectiveness in cost reduction. We evaluate our methods on several real-world and synthetic datasets and queries, and the results show ours consistently achieve good speedup and, in extreme cases, up to 3x speedup and 98708x speedup, when compared to index-assisted uniform sampling and classic scan-based stratified sampling respectively.

翻译：基于扁平模式中频繁更新数据的即席查询在实时数据分析应用中非常常见，通常需要极低的延迟。在线聚合通过提供带有置信界保证的近似聚合答案来实现这一点。它依赖于能够以样本大小而非数据库大小的线性时间在线抽取样本的能力，这可以通过索引辅助的基于采样的近似查询处理系统来支持。然而，在这些系统中，由于数据值分布和选择性的高方差导致样本量增加，为了达到期望的置信界所需的过度采样成本，仍然可能导致近似查询的查询延迟。经典的带有内曼分配的分层抽样方法在理论上可以最小化样本量，但若干挑战使其难以应用于索引辅助的S-AQP系统，包括需要先验统计信息、优化成本高，以及基于样本量的不准确抽样成本模型。为此，我们设计了面向在线聚合的索引辅助分层抽样，其特点在于两阶段抽样框架。第一阶段抽取的样本既用于在线聚合，也用于优化未来的采样成本，而第二阶段则使用优化后的分层继续进行在线聚合。我们证明了基于索引的抽样成本模型的最优分层和样本量分配策略，并设计了多种基于贪心和动态规划的优化方法，以平衡优化成本和成本降低效果。我们在多个真实世界和合成数据集及查询上评估了我们的方法，结果表明，与索引辅助的均匀抽样和经典的基于扫描的分层抽样相比，我们的方法始终能实现良好的加速效果，在极端情况下分别能达到3倍和98708倍的加速比。