Deep Learning (DL) techniques now constitute the state-of-the-art for important problems in areas such as text and image processing, and there have been impactful results that deploy DL in several data management tasks. Deep Clustering (DC) has recently emerged as a sub-discipline of DL, in which data representations are learned in tandem with clustering, with a view to automatically identifying the features of the data that lead to improved clustering results. While DC has been used to good effect in several domains, particularly in image processing, the impact of DC on mainstream data management tasks still remains unexplored. In this paper, we address this gap, by investigating the impact of DC in canonical data cleaning and integration tasks, including schema inference, entity resolution and domain discovery, tasks which represent clustering form the perspective of tables, rows and columns, respectively. In this setting, we compare and contrast several DC and non-DC clustering algorithms using standard benchmarks. The results show, among other things, that the most effective DC algorithms consistently outperform non-DC clustering algorithms for data integration tasks. However, we also observed that the chosen embedding approaches for rows, columns, and tables significantly impacted the clustering performance.
翻译:深度学习技术如今已成为文本和图像处理等领域关键问题的前沿方法,并在多项数据管理任务中取得了富有影响力的成果。深度聚类作为深度学习的一个子学科近期兴起,其核心思想是在聚类过程中同步学习数据表示,旨在自动识别能提升聚类效果的数据特征。尽管深度聚类已在图像处理等多个领域展现出良好效果,但其对主流数据管理任务的影响仍尚未被深入探索。为填补这一空白,本文系统研究了深度聚类在模式推断、实体消解和领域发现等经典数据清洗与整合任务中的影响——这些任务分别对应于以表、行和列为单位的聚类问题。我们基于标准基准数据集,对多种深度聚类与非深度聚类算法进行了横向对比分析。研究结果表明:在数据整合任务中,最优的深度聚类算法始终显著优于非深度聚类算法;同时我们也观察到,针对行、列和表所采用的嵌入方法对聚类性能具有重要影响。