A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results.
翻译:数据挖掘领域的一个新趋势是探索针对具有类别关系类型数据的(超)图聚类算法。此类算法在社交网络、合著网络以及蛋白质相互作用网络等分析中具有广泛应用。许多此类应用天然存在聚类之间的重叠,而当前组合模型却缺失了这一细微特征。此外,现有模型缺乏处理数据集中噪声的机制。我们通过推广边彩色聚类(一种针对超图类别聚类的近期框架)来解决这些问题。我们的推广允许预算约束下的(a)重叠聚类分配或(b)节点删除。针对每个新模型,我们提出一种贪心算法,该算法近似最小化边错误目标,并提供了双准则近似方法,其中第二近似因子作用于预算。此外,我们还研究了每个问题的参数化复杂性,给出了固定参数易处理(FPT)算法及难解性结果。