Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for $k$-median and $k$-means clustering on relational data without the need for pre-computing the join query results. For the relational $k$-median clustering, we propose the first efficient relative approximation algorithm. For the relational $k$-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational $k$-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query $Q$ and a database instance $D$ of $O(N)$ tuples, for both $k$-median and $k$-means clustering on the results of $Q$ on $D$, we propose randomized $(1+\varepsilon)γ$-approximation algorithms that run in roughly $O(k^2N^{\mathsf{fhw}})+T_γ(k^2)$ time, where $\varepsilon\in (0,1)$ is a constant parameter decided by the user, $\mathsf{fhw}$ is the fractional hyper-tree width of $Q$, while $γ$ and $T_γ(x)$ are respectively the approximation factor and the running time of a traditional clustering algorithm in the standard computational setting over $x$ points.
翻译:聚类在计算机科学中扮演着关键角色,在众多领域促进数据分析和问题解决。通过将大型数据集划分为有意义的组别,聚类能够揭示数据中隐藏的结构与关联,辅助无监督学习、分类、异常检测和推荐系统等任务。尤其在关系型数据库中,数据分布在多个表中,由于连接表的计算复杂性,高效聚类至关重要且具有挑战性。本文通过引入无需预先计算连接查询结果的关系型数据$k$-中位数与$k$-均值聚类高效算法来应对这一挑战。针对关系型$k$-中位数聚类,我们提出了首个高效的相对近似算法。对于关系型$k$-均值聚类,我们的算法显著改进了现有关系型$k$-均值聚类算法的近似因子与运行时间——现有算法或存在较大常数近似因子,或具有高昂的时间复杂度。给定连接查询$Q$和包含$O(N)$个元组的数据库实例$D$,针对$Q$在$D$上结果集的$k$-中位数与$k$-均值聚类,我们提出了随机化$(1+\varepsilon)γ$-近似算法,其运行时间约为$O(k^2N^{\mathsf{fhw}})+T_γ(k^2)$,其中$\varepsilon\in (0,1)$为用户决定的常数参数,$\mathsf{fhw}$为$Q$的分数超树宽度,而$γ$和$T_γ(x)$分别表示标准计算环境下对$x$个点进行传统聚类时的近似因子与运行时间。