Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for $k$-median and $k$-means clustering on relational data without the need for pre-computing the join query results. For the relational $k$-median clustering, we propose the first efficient relative approximation algorithm. For the relational $k$-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational $k$-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query $Q$ and a database instance $D$ of $O(N)$ tuples, for both $k$-median and $k$-means clustering on the results of $Q$ on $D$, we propose randomized $(1+\varepsilon)\gamma$-approximation algorithms that run in roughly $O(k^2N^{\mathsf{fhw}})+T_\gamma(k^2)$ time, where $\varepsilon\in (0,1)$ is a constant parameter decided by the user, $\mathsf{fhw}$ is the fractional hyper-tree width of $Q$, while $\gamma$ and $T_\gamma(x)$ are respectively the approximation factor and the running time of a traditional clustering algorithm in the standard computational setting over $x$ points.
翻译:聚类在计算机科学中扮演着关键角色,通过将大型数据集划分为有意义的组别,能够揭示数据中隐藏的结构与关联,从而促进跨众多领域的数据分析与问题求解。聚类技术广泛应用于无监督学习、分类、异常检测和推荐系统等任务。尤其在关系型数据库中,数据通常分布在多个表中,由于表连接操作的计算复杂性,实现高效聚类既至关重要又极具挑战。本文针对这一挑战,提出了无需预先计算连接查询结果即可对关系型数据进行$k$-中位数与$k$-均值聚类的有效算法。对于关系型$k$-中位数聚类,我们提出了首个高效的相对近似算法。对于关系型$k$-均值聚类,我们的算法在近似比和运行时间上均显著优于现有算法——现有算法或存在较大的常数近似比缺陷,或具有高昂的时间复杂度。给定连接查询$Q$和包含$O(N)$个元组的数据库实例$D$,针对$Q$在$D$上查询结果的$k$-中位数与$k$-均值聚类问题,我们提出了随机化的$(1+\varepsilon)\gamma$近似算法,其运行时间约为$O(k^2N^{\mathsf{fhw}})+T_\gamma(k^2)$。其中$\varepsilon\in (0,1)$为用户设定的常数参数,$\mathsf{fhw}$为查询$Q$的分数超树宽度,而$\gamma$和$T_\gamma(x)$分别表示传统聚类算法在标准计算环境下处理$x$个数据点时的近似比与运行时间。