Majority subspace clustering (SC) algorithms depend on one or more hyperparameters that need to be carefully tuned for the SC algorithms to achieve high clustering performance. Hyperparameter optimization (HPO) is often performed using grid-search, assuming that some labeled data is available. In some domains, such as medicine, this assumption does not hold true in many cases. One avenue of research focuses on developing SC algorithms that are inherently free of hyperparameters. For hyperparameters-dependent SC algorithms, one approach to label-independent HPO tuning is based on internal clustering quality metrics (if available), whose performance should ideally match that of external (label-dependent) clustering quality metrics. In this paper, we propose a novel approach to label-independent HPO that uses clustering quality metrics, such as accuracy (ACC) or normalized mutual information (NMI), that are computed based on pseudo-labels obtained from the SC algorithm across a predefined grid of hyperparameters. Assuming that ACC (or NMI) is a smooth function of hyperparameter values it is possible to select subintervals of hyperparameters. These subintervals are then iteratively further split into halves or thirds until a relative error criterion is satisfied. In principle, the hyperparameters of any SC algorithm can be tuned using the proposed method. We demonstrate this approach on several single- and multi-view SC algorithms, comparing the achieved performance with their oracle versions across six datasets representing digits, faces and objects. The proposed method typically achieves clustering performance that is 5% to 7% lower than that of the oracle versions. We also make our proposed method interpretable by visualizing subspace bases, which are estimated from the computed clustering partitions. This aids in the initial selection of the hyperparameter search space.
翻译:大多数子空间聚类(SC)算法依赖于一个或多个超参数,这些超参数需要仔细调整才能使SC算法实现较高的聚类性能。超参数优化(HPO)通常通过网格搜索进行,前提是假设存在一些带标签的数据。在某些领域(如医学),这一假设在许多情况下并不成立。一个研究方向专注于开发本质上无需超参数的SC算法。对于依赖超参数的SC算法,一种独立于标签的HPO调优方法基于内部聚类质量度量(如果可用),其性能理想情况下应与外部(依赖标签的)聚类质量度量相匹配。本文提出了一种新的独立于标签的HPO方法,该方法使用基于伪标签计算的聚类质量度量(如准确率(ACC)或归一化互信息(NMI)),这些伪标签是通过SC算法在预定义的超参数网格上获得的。假设ACC(或NMI)是超参数值的平滑函数,则可以选取超参数的子区间。然后,这些子区间被迭代地进一步二等分或三等分,直到满足相对误差准则。原则上,任何SC算法的超参数都可以使用所提出的方法进行调优。我们在多个单视图和多视图SC算法上验证了该方法,并在代表数字、人脸和物体的六个数据集上,将所达到的性能与它们的oracle版本进行了比较。所提出的方法通常实现的聚类性能比oracle版本低5%至7%。我们还通过可视化从计算的聚类分区中估计的子空间基,使所提出的方法具有可解释性。这有助于初始超参数搜索空间的选择。