In a data matrix, we may distinguish between cases, each represented by a row vector for a statistical unit, and cells, which correspond to single entries of the data matrix. Recent developments in Robust Statistics have introduced the cellwise contamination paradigm, which assumes contamination on cells rather than on entire cases. This approach becomes particularly relevant as the number of variables increases. Indeed, discarding or downweighting entire cases because of a few anomalous cells in them, as done by traditional (casewise) robust methods, can result in substantial information loss, since the non-contaminated (or reliable) cells can still be highly informative. This philosophy can also be considered in fuzzy clustering, by assuming that reliable cells within a case may still provide useful information for determining fuzzy memberships. A robust fuzzy clustering proposal is thus introduced in this work, combining the advantages of dealing with outlying cells and simultaneously controlling the degree of fuzziness of unit assignments. The cluster-specific relationships among variables, detected by the fuzzy clustering approach, are also key to better identifying outlying cells and correct them. The strengths of the proposed methodology are illustrated through a simulation study and two real-world applications. The effects of the model's tuning parameters are explored, and some guidance for users on how to set them suitably is provided.
翻译:在数据矩阵中,我们可区分出由统计单元行向量表示的个案,以及对应数据矩阵中单个条目的单元。鲁棒统计领域的最新发展引入了单元污染范式,该范式假设污染作用于单元而非整个个案。随着变量数量的增加,这种方法变得尤为重要。实际上,传统(个案式)鲁棒方法因个案中存在少数异常单元而丢弃或降低整个个案的权重,这可能导致严重的信息损失,因为未受污染(或可靠)的单元仍可提供高度信息量。这种理念同样可应用于模糊聚类,即假设个案内的可靠单元仍能为确定模糊隶属度提供有用信息。因此,本文提出了一种鲁棒模糊聚类方案,结合了处理离群单元与控制单元分配模糊度的双重优势。通过模糊聚类方法检测到的变量间聚类特定关系,也是识别离群单元并予以修正的关键。通过模拟研究和两个实际应用案例,展示了所提方法的优势。探索了模型调优参数的影响,并为用户提供了如何恰当设置这些参数的指导。