Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations.
翻译:混合模型常被用于从观测数据中识别有意义的子群体(即聚类),使得这些子群体具有现实世界的解释(例如细胞类型)。然而,当用于子群体发现时,混合模型的推断通常先验地缺乏明确定义,因为假设的观测模型仅是对真实数据生成过程的近似。因此,随着观测数量的增加,不仅未能获得更好的推断,反而出现相反情况:数据通过添加虚假子群体来解释,这些子群体用于补偿观测模型的不足。然而,我们可利用两种重要的先验知识来获得无论数据集大小均明确定义的结果:已知的因果结构(例如,已知潜在子群体导致观测信号而非相反)以及对观测模型错误程度的大致感知(例如,基于少量专家标注数据或对数据生成过程的某种理解)。我们提出了一种新的模型选择准则,该准则基于模型,利用这些可用知识获得对观测模型误设具有鲁棒性的混合模型推断。我们通过证明在直观假设下首个此类一致性结果,为方法提供了理论支持。仿真研究与流式细胞术数据应用表明,我们的模型选择准则能够始终如一地找到正确的子群体数量。