Data profiling is an essential process in modern data-driven industries. One of its critical components is the discovery and validation of complex statistics, including functional dependencies, data constraints, association rules, and others. However, most existing data profiling systems that focus on complex statistics do not provide proper integration with the tools used by contemporary data scientists. This creates a significant barrier to the adoption of these tools in the industry. Moreover, existing systems were not created with industrial-grade workloads in mind. Finally, they do not aim to provide descriptive explanations, i.e. why a given pattern is not found. It is a significant issue as it is essential to understand the underlying reasons for a specific pattern's absence to make informed decisions based on the data. Because of that, these patterns are effectively rest in thin air: their application scope is rather limited, they are rarely used by the broader public. At the same time, as we are going to demonstrate in this presentation, complex statistics can be efficiently used to solve many classic data quality problems. Desbordante is an open-source data profiler that aims to close this gap. It is built with emphasis on industrial application: it is efficient, scalable, resilient to crashes, and provides explanations. Furthermore, it provides seamless Python integration by offloading various costly operations to the C++ core, not only mining. In this demonstration, we show several scenarios that allow end users to solve different data quality problems. Namely, we showcase typo detection, data deduplication, and data anomaly detection scenarios.
翻译:数据轮廓分析是现代数据驱动行业中的一个关键过程。其重要组成部分之一是发现和验证复杂统计信息,包括函数依赖、数据约束、关联规则等。然而,大多数专注于复杂统计信息的现有数据轮廓分析系统并未与当代数据科学家使用的工具实现适当集成。这为这些工具在行业中的采用造成了重大障碍。此外,现有系统在设计时并未考虑到工业级工作负载。最后,它们也未能提供描述性解释,即为何未发现特定模式。这是一个重要问题,因为理解特定模式缺失的根本原因对于基于数据做出明智决策至关重要。因此,这些模式实际上悬而未决:它们的应用范围相当有限,很少被广泛用户使用。同时,正如我们将在本次演示中展示的,复杂统计信息可以有效用于解决许多经典数据质量问题。Desbordante是一个旨在弥合这一差距的开源数据轮廓分析工具。它注重工业应用:高效、可扩展、能抵御崩溃,并提供解释。此外,它通过将各种高成本操作(不仅限于挖掘)卸载到C++核心,实现了与Python的无缝集成。在此次演示中,我们展示了让终端用户解决不同数据质量问题的多个场景,具体包括错别字检测、数据去重和数据异常检测场景。