$\kC$ clustering is a fundamental classification problem, where the task is to categorize the given collection of entities into $k$ clusters and come up with a representative for each cluster, so that the maximum distance between an entity and its representative is minimized. In this work, we focus on the setting where the entities are represented by binary vectors with missing entries, which model incomplete categorical data. This version of the problem has wide applications, from predictive analytics to bioinformatics. Our main finding is that the problem, which is notoriously hard from the classical complexity viewpoint, becomes tractable as soon as the known entries are sparse and exhibit a certain structure. Formally, we show fixed-parameter tractable algorithms for the parameters vertex cover, fracture number, and treewidth of the row-column graph, which encodes the positions of the known entries of the matrix. Additionally, we tie the complexity of the 1-cluster variant of the problem, which is famous under the name Closest String, to the complexity of solving integer linear programs with few constraints. This implies, in particular, that improving upon the running times of our algorithms would lead to more efficient algorithms for integer linear programming in general.
翻译:$\kC$聚类是一个基本的分类问题,其任务是将给定的实体集合划分为$k$个簇,并为每个簇选取一个代表,使得实体与其代表之间的最大距离最小化。在本工作中,我们关注实体由带缺失项的二值向量表示的情况,这模拟了不完整的分类数据。该问题版本具有广泛的应用,从预测分析到生物信息学。我们的主要发现是,从经典计算复杂度视角看本问题 notoriously hard,但只要已知项具有稀疏性并呈现特定结构,该问题就变得可解。形式化地,我们针对行-列图(该图编码了矩阵已知项的位置)的顶点覆盖数、断裂数和树宽等参数,给出了固定参数可解算法。此外,我们将该问题的1-簇变体(以Closest String问题著称)的复杂度与求解约束较少的整数线性规划的复杂度联系起来。这尤其意味着,改进我们算法的运行时间将带来更高效的通用整数线性规划算法。