Robust Clustering with Normal Mixture Models: A Pseudo $β$-Likelihood Approach

As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $\beta$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.

翻译：与其他估计场景类似，在正态混合模型设定下，基于似然的估计方法对模型误设和异常值的存在高度不稳健（这本身也是一个不适定的优化问题）。本文提出了一种针对该估计问题的普通似然方法的鲁棒替代方案，该方法可同时进行估计与数据聚类，并实现后续异常检测。为引入鲁棒性，该方法在适当约束条件下采用基于密度功率散度最小化（或等价地，β似然最大化）的技术。通过迭代重加权最小二乘方法，计算所提出的分量均值（即聚类中心）和分量散度矩阵的估计量，从而同时实现数据聚类。本文还提出了一些用于异常检测的探索性技术——这是统计学与机器学习领域的重要问题。所提方法在不同设定下的模拟研究中得到验证：与K-medoids、TCLUST、修剪K-means及MCLUST等主流现有方法相比，该方法在混合分量（即聚类）区域存在显著重叠，或存在权重虽小但不可忽略的离群聚类（尤其在高维场景下）时，表现具有竞争力甚至更优。本文还通过两个真实数据集（含图像处理应用）对比展示了新方法的性能。所提方法能以更低误分类率检测聚类，并成功识别这些数据集中的异常观测值。