Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks remains unexplored. In this paper, we address this gap by investigating the impact of DC in data cleaning and integration tasks, specifically schema inference, entity resolution, and domain discovery, tasks that represent clustering from the perspective of tables, rows, and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we observed a significant correlation between the DC method and embedding approaches for rows, columns, and tables, highlighting that the suitable combination can enhance the efficiency of DC methods.
翻译:深度学习技术如今已成为文本与图像处理等重要领域的前沿方法,并已在多个数据管理任务中取得突破性成果。深度聚类作为深度学习的新兴子领域,通过联合学习数据表征与聚类过程,自动识别能提升聚类效果的数据特征。尽管深度聚类在图像处理等若干领域已展现出良好效果,但其对主流数据管理任务的影响仍待探索。本文通过研究深度聚类在数据清洗与集成任务(具体包括模式推断、实体解析和域发现——它们分别代表从表、行和列视角进行的聚类任务)中的影响,填补了这一空白。在此背景下,我们使用标准基准对多种深度聚类与非深度聚类算法进行了比较分析。结果表明,在数据集成任务中,最有效的深度聚类算法始终优于非深度聚类算法。然而,我们观察到深度聚类方法与面向行、列和表的嵌入方法之间存在显著相关性,这表明适当的组合策略能够提升深度聚类方法的效率。