Mixed membership models are an extension of finite mixture models, where each observation can partially belong to more than one mixture component. A probabilistic framework for mixed membership models of high-dimensional continuous data is proposed with a focus on scalability and interpretability. The novel probabilistic representation of mixed membership is based on convex combinations of dependent multivariate Gaussian random vectors. In this setting, scalability is ensured through approximations of a tensor covariance structure through multivariate eigen-approximations with adaptive regularization imposed through shrinkage priors. Conditional weak posterior consistency is established on an unconstrained model, allowing for a simple posterior sampling scheme while keeping many of the desired theoretical properties of our model. The model is motivated by two biomedical case studies: a case study on functional brain imaging of children with autism spectrum disorder (ASD) and a case study on gene expression data from breast cancer tissue. These applications highlight how the typical assumption made in cluster analysis, that each observation comes from one homogeneous subgroup, may often be restrictive in several applications, leading to unnatural interpretations of data features.
翻译:混合成员模型是有限混合模型的扩展,其中每个观测值可以部分属于多个混合成分。针对高维连续数据,提出了一种以可扩展性和可解释性为重点的混合成员模型概率框架。该新型概率表示基于依赖多元高斯随机向量的凸组合。在该框架下,通过多元特征逼近对张量协方差结构进行近似,并利用收缩先验施加自适应正则化,从而确保可扩展性。在无约束模型上建立了条件弱后验一致性,使得后验采样方案简洁高效,同时保留模型的诸多理想理论性质。该模型由两个生物医学案例研究驱动:自闭症谱系障碍(ASD)儿童功能脑成像案例研究,以及乳腺癌组织基因表达数据案例研究。这些应用凸显了聚类分析中的典型假设(即每个观测值来自单一同质子群)在许多应用中往往具有局限性,导致对数据特征产生不自然的解释。