模糊聚类的统计推断 (Statistical Inference for Fuzzy Clustering)

Clustering is a central tool in biomedical research for discovering heterogeneous patient subpopulations, where group boundaries are often diffuse rather than sharply separated. Traditional methods produce hard partitions, whereas soft clustering methods such as fuzzy $c$-means (FCM) allow mixed memberships and better capture uncertainty and gradual transitions. Despite the widespread use of FCM, principled statistical inference for fuzzy clustering remains limited. We develop a new framework for weighted fuzzy $c$-means (WFCM) for settings with potential cluster size imbalance. Cluster-specific weights rebalance the classical FCM criterion so that smaller clusters are not overwhelmed by dominant groups, and the weighted objective induces a normalized density model with scale parameter $σ$ and fuzziness parameter $m$. Estimation is performed via a blockwise majorize--minimize (MM) procedure that alternates closed-form membership and centroid updates with likelihood-based updates of $(σ,\bw)$. The intractable normalizing constant is approximated by importance sampling using a data-adaptive Gaussian mixture proposal. We further provide likelihood ratio tests for comparing cluster centers and bootstrap-based confidence intervals. We establish consistency and asymptotic normality of the maximum likelihood estimator, validate the method through simulations, and illustrate it using single-cell RNA-seq and Alzheimer disease Neuroimaging Initiative (ADNI) data. These applications demonstrate stable uncertainty quantification and biologically meaningful soft memberships, ranging from well-separated cell populations under imbalance to a graded AD versus non-AD continuum consistent with disease progression.

翻译：聚类是生物医学研究中发现异质性患者亚群的核心工具，其组间边界通常呈弥散状而非清晰分离。传统方法产生硬划分，而软聚类方法如模糊c均值（FCM）则允许混合隶属度，能更好地捕捉不确定性与渐变过渡。尽管FCM已被广泛应用，针对模糊聚类的严谨统计推断仍十分有限。本文针对可能存在簇规模不平衡的场景，提出一种加权模糊c均值（WFCM）新框架。通过簇特异性权重对经典FCM准则进行再平衡，使得较小簇群不被主导组别淹没；该加权目标函数导出一个具有尺度参数σ与模糊度参数m的归一化密度模型。估计过程采用分块优化-最小化（MM）算法，以闭式更新的隶属度与质心交替进行基于似然函数的(σ,w)参数更新。其中难以处理的归一化常数通过数据自适应高斯混合提案的重要性采样进行近似。我们进一步提出用于比较聚类中心的似然比检验及基于自助法的置信区间。本文证明了最大似然估计量的一致性与渐近正态性，通过仿真验证了方法的有效性，并运用单细胞RNA测序与阿尔茨海默病神经影像学倡议（ADNI）数据进行案例演示。这些应用展现了稳定的不确定性量化能力与具有生物学意义的软隶属关系——从不平衡场景下分离良好的细胞群，到符合疾病进展规律的AD与非AD连续渐变谱系。