A recent trend in data mining has explored (hyper)graph clustering algorithms for data with categorical relationship types. Such algorithms have applications in the analysis of social, co-authorship, and protein interaction networks, to name a few. Many such applications naturally have some overlap between clusters, a nuance which is missing from current combinatorial models. Additionally, existing models lack a mechanism for handling noise in datasets. We address these concerns by generalizing Edge-Colored Clustering, a recent framework for categorical clustering of hypergraphs. Our generalizations allow for a budgeted number of either (a) overlapping cluster assignments or (b) node deletions. For each new model we present a greedy algorithm which approximately minimizes an edge mistake objective, as well as bicriteria approximations where the second approximation factor is on the budget. Additionally, we address the parameterized complexity of each problem, providing FPT algorithms and hardness results.
翻译:数据挖掘领域的新趋势是探索针对具有类别关系类型数据的(超)图聚类算法。此类算法在社交网络、合著关系网络及蛋白质交互网络等分析中具有广泛应用。许多实际应用场景中,不同聚类之间自然存在部分重叠,而当前组合模型尚未涵盖这一关键特性。此外,现有模型缺乏处理数据噪声的机制。针对上述问题,我们将超图类别聚类的新框架——多色边聚类(Edge-Colored Clustering)进行泛化。我们的泛化模型允许在预算约束下实现:(a)重叠聚类分配或(b)节点删除。针对每个新模型,我们提出一种贪心算法,该算法可在近似意义上最小化边误分类目标函数,同时给出双准则近似解(其中第二近似因子作用于预算约束)。此外,我们探讨了各问题的参数化复杂度,提出了固定参数可解(FPT)算法与难解性结论。