Inspired by the visualization of dental plaque at the dentist's office, this article proposes a novel visualization technique for identifying redundancies in relational data. Our approach builds upon an established information-theoretic framework that, despite being well-principled, remains unexplored in practical applications. In this framework, we calculate the information content (or entropy) of each cell in a relation instance, given a set of functional dependencies. The entropy value represents the likelihood of inferring the cell's value based on the dependencies and the remaining tuples. By highlighting cells with lower entropy, we effectively visualize redundancies in the data. We present an initial prototype implementation and demonstrate that a straightforward approach is insufficient for handling practical problem sizes. To address this limitation, we propose several optimizations, which we prove to be correct. Additionally, we present a Monte Carlo approximation technique with a known error, enabling computationally tractable computations. Using a real-world dataset of modest size, we illustrate the potential of our visualization technique. Our vision is to support domain experts with data profiling and data cleaning tasks, akin to the functionality of a plaque test at the dentist's.
翻译:受牙科诊所中牙菌斑可视化技术的启发,本文提出一种新颖的可视化方法,用于识别关系数据中的冗余信息。我们的方法基于一个成熟的信息论框架——该框架尽管理论根基坚实,但在实际应用中仍鲜有探索。在此框架中,我们根据一组函数依赖关系,计算关系实例中每个单元格的信息含量(即熵值)。熵值表示在已知依赖关系及其他元组的情况下,推断该单元格值的可能性。通过高亮低熵值的单元格,我们能够直观地呈现数据中的冗余信息。我们展示了初始原型实现,并指出直接方法无法处理实际规模的问题。为解决此局限,我们提出了若干优化策略并证明了其正确性。此外,我们提出一种具有已知误差的蒙特卡洛近似技术,使得计算在可处理范围内成为可能。利用一个中等规模的真实数据集,我们展示了该可视化技术的潜力。我们的愿景是像牙科诊所用牙菌斑检测那样,为领域专家的数据剖析与数据清洗任务提供支持。