Most research on data discovery has so far focused on improving individual discovery operators such as join, correlation, or union discovery. However, in practice, a combination of these techniques and their corresponding indexes may be necessary to support arbitrary discovery tasks. We propose BLEND, a comprehensive data discovery system that supports existing operators and enables their flexible pipelining. BLEND is based on a set of lower-level operators that serve as fundamental building blocks for more complex and sophisticated user tasks. To reduce the execution runtime of discovery pipelines, we propose a unified index structure and a rule-based optimizer that rewrites SQL statements into low-level operators when possible. We show the superior flexibility and efficiency of our system compared to ad-hoc discovery pipelines and stand-alone solutions.
翻译:目前,大多数关于数据发现的研究主要集中于改进诸如连接发现、相关性发现或并集发现等单一发现算子。然而,在实践中,可能需要结合这些技术及其对应的索引来支持任意的发现任务。我们提出了BLEND,一个全面的数据发现系统,它支持现有的算子,并能够实现这些算子的灵活流水线组合。BLEND基于一组底层算子,这些算子可作为更复杂、更高级用户任务的基础构建模块。为了减少发现流水线的执行时间,我们提出了一种统一的索引结构和一个基于规则的优化器,该优化器在可能时将SQL语句重写为底层算子。我们展示了本系统相较于临时构建的发现流水线和独立解决方案在灵活性和效率上的显著优势。