Organizations are collecting increasingly large amounts of data for data driven decision making. These data are often dumped into a centralized repository, e.g., a data lake, consisting of thousands of structured and unstructured datasets. Perversely, such mixture of datasets makes the problem of discovering elements (e.g., tables or documents) that are relevant to a user's query or an analytical task very challenging. Despite the recent efforts in data discovery, the problem remains widely open especially in the two fronts of (1) discovering relationships and relatedness across structured and unstructured datasets where existing techniques suffer from either scalability, being customized for a specific problem type (e.g., entity matching or data integration), or demolishing the structural properties on its way, and (2) developing a holistic system for integrating various similarity measurements and sketches in an effective way to boost the discovery accuracy. In this paper, we propose a new data discovery system, named CMDL, for addressing these two limitations. CMDL supports the data discovery process over both structured and unstructured data while retaining the structural properties of tables.
翻译:各组织正在收集日益庞大的数据以支持数据驱动决策。这些数据通常被存入集中式存储库(如数据湖),其中包含数千个结构化和非结构化数据集。然而,这种数据集的混合使得发现与用户查询或分析任务相关的要素(如表或文档)变得极具挑战性。尽管近年来在数据发现领域取得了一定进展,但该问题仍待解决,主要体现在两个方面:(1)在跨结构化和非结构化数据集发现关联与相关性时,现有技术或存在可扩展性不足的问题,或局限于特定问题类型(如实体匹配或数据集成),或在此过程中破坏了数据结构特性;(2)缺乏一个整体性系统,能够有效整合多种相似性度量方法与摘要技术以提升发现精度。本文提出了一种名为CMDL的新型数据发现系统,以解决上述两个局限性。CMDL支持对结构化与非结构化数据的发现过程,同时保留表的结构特性。