Data scientists often draw on multiple relational data sources for analysis. A standard assumption in learning and approximate query answering is that the data is a uniform and independent sample of the underlying distribution. To avoid the cost of join and union, given a set of joins, we study the problem of obtaining a random sample from the union of joins without performing the full join and union. We present a general framework for random sampling over the set union of chain, acyclic, and cyclic joins, with sample uniformity and independence guarantees. We study the novel problem of the union of joins size evaluation and propose two approximation methods based on histograms of columns and random walks on data. We propose an online union sampling framework that initializes with cheap-to-calculate parameter approximations and refines them on the fly during sampling. We evaluate our framework on workloads from the TPC-H benchmark and explore the trade-off of the accuracy of union approximation and sampling efficiency.
翻译:数据科学家常需结合多个关系型数据源进行分析。在机器学习和近似查询回答领域,一个标准假设是数据来自底层分布的均匀独立样本。为避免执行完整连接与并集操作的高昂代价,本文针对给定连接集合,研究如何在不执行完整连接与并集的情况下,从连接并集中获取随机样本。我们提出了一种通用框架,可对链式、无环及有环连接集合进行随机采样,并保证样本的均匀性与独立性。本文首次研究了连接并集规模评估这一新问题,提出了基于列直方图与数据随机游走的两种近似方法。我们进一步设计了在线并集采样框架,该框架以低成本参数近似初始化,并在采样过程中动态优化参数。通过TPC-H基准测试工作负载评估框架性能,我们深入探讨了并集近似精度与采样效率之间的权衡关系。