The Gaussian mixture model is widely used in unsupervised learning, owing to its simplicity and interpretability. However, a fundamental limitation of the classical Gaussian mixture model is that it forces each observation to belong to exactly one component. In many practical applications, such as genetics, social network analysis, and text mining, an observation may naturally belong to multiple components or exhibit partial membership in several latent components. To overcome this limitation, we propose the mixed membership sub-Gaussian model, which extends the classical Gaussian mixture framework by allowing each observation to belong to multiple components. This model inherits the interpretability of the classical Gaussian mixture model while offering greater flexibility for capturing complex overlapping structures. We develop an efficient spectral algorithm to estimate the mixed membership of each individual observation, and under mild separation conditions on the component centres, we prove that the estimation error of the per-individual membership vector can be made arbitrarily small with high probability. To our knowledge, this is the first work to provide a computationally efficient estimator with such a vanishing-error guarantee for a mixed-membership extension of the Gaussian mixture model. Extensive experimental studies demonstrate that our method outperforms existing approaches that ignore mixed memberships.
翻译:高斯混合模型因其简洁性和可解释性,在无监督学习领域得到广泛应用。然而,经典高斯混合模型的一个根本局限在于它强制每个观测值严格属于单一成分。在遗传学、社交网络分析和文本挖掘等实际应用中,观测值可能天然地属于多个成分,或在多个潜在成分中呈现部分隶属度。为突破这一限制,我们提出混合隶属度子高斯模型,该模型通过允许每个观测值属于多个成分来扩展经典高斯混合框架。该模型既继承了经典高斯混合模型的可解释性,又为捕获复杂的重叠结构提供了更高灵活性。我们开发了一种高效的谱算法来估计每个观测值的混合隶属度,并在成分中心满足温和分离条件的情况下,证明了每个个体隶属度向量的估计误差可以高概率地任意小。据我们所知,这是首个为高斯混合模型的混合隶属度扩展提供具计算高效性且具有这种消失误差保证估计量的工作。大量实验研究表明,我们的方法优于忽略混合隶属度的现有方法。