Analytical join queries over unstructured data are increasingly prevalent in data analytics. Applying machine learning (ML) models to label every pair in the cross product of tables can achieve state-of-the-art accuracy, but the cost of pairwise execution of ML models is prohibitive. Existing algorithms, such as embedding-based blocking and sampling, aim to reduce this cost. However, they either fail to provide statistical guarantees (leading to errors up to 79% higher than expected) or become as inefficient as uniform sampling. We propose blocking-augmented sampling (BaS), which simultaneously achieves statistical guarantees and high efficiency. BaS optimally orchestrates embedding-based blocking and sampling to mitigate their respective limitations. Specifically, BaS allocates data tuples in the cross product into two regimes based on the failure modes of embeddings. In the regime of false negatives, BaS uses sampling to estimate the result. In the regime of false positives, BaS applies embedding-based blocking to improve efficiency. To minimize the estimation error given a budget for ML executions, we design a novel two-stage algorithm that adaptively allocates the budget between blocking and sampling. Theoretically, we prove that BaS asymptotically outperforms or matches standalone sampling. On real-world datasets across different modalities, we show that BaS provides valid confidence intervals and reduces estimation errors by up to 19$\times$, compared to state-of-the-art baselines.
翻译:无结构数据分析连接查询在数据分析中日益普遍。应用机器学习(ML)模型对表间笛卡尔积的每一对数据进行标注可实现最优精度,但成对执行ML模型的成本过高。现有算法(如基于嵌入的块化与采样)旨在降低此成本,然而它们要么无法提供统计保证(导致误差比预期高出79%),要么效率降至与均匀采样相当。本文提出块化增强采样(BaS),可同时实现统计保证与高效率。BaS通过优化协调基于嵌入的块化与采样来克服各自局限:首先根据嵌入的失效模式将笛卡尔积中的数据元组划分为两个区域。在假阴性区域,BaS采用采样进行结果估计;在假阳性区域,则应用基于嵌入的块化提升效率。为在给定ML执行预算下最小化估计误差,我们设计了一种新颖的两阶段算法,可自适应分配块化与采样间的预算。理论证明表明,BaS在渐近意义上优于或匹配独立采样方法。在跨模态的真实数据集实验中,BaS不仅提供有效的置信区间,相较于最优基线方法更能将估计误差降低高达19倍。