Structurally Aware Robust Model Selection for Mixtures

Mixture models are often used to identify meaningful subpopulations (i.e., clusters) in observed data such that the subpopulations have a real-world interpretation (e.g., as cell types). However, when used for subpopulation discovery, mixture model inference is usually ill-defined a priori because the assumed observation model is only an approximation to the true data-generating process. Thus, as the number of observations increases, rather than obtaining better inferences, the opposite occurs: the data is explained by adding spurious subpopulations that compensate for the shortcomings of the observation model. However, there are two important sources of prior knowledge that we can exploit to obtain well-defined results no matter the dataset size: known causal structure (e.g., knowing that the latent subpopulations cause the observed signal but not vice-versa) and a rough sense of how wrong the observation model is (e.g., based on small amounts of expert-labeled data or some understanding of the data-generating process). We propose a new model selection criteria that, while model-based, uses this available knowledge to obtain mixture model inferences that are robust to misspecification of the observation model. We provide theoretical support for our approach by proving a first-of-its-kind consistency result under intuitive assumptions. Simulation studies and an application to flow cytometry data demonstrate our model selection criteria consistently finds the correct number of subpopulations.

翻译：混合模型常被用于从观测数据中识别有意义的子群体（即聚类），使得这些子群体具有现实世界的解释（例如细胞类型）。然而，当用于子群体发现时，混合模型的推断通常先验地缺乏明确定义，因为假设的观测模型仅是对真实数据生成过程的近似。因此，随着观测数量的增加，不仅未能获得更好的推断，反而出现相反情况：数据通过添加虚假子群体来解释，这些子群体用于补偿观测模型的不足。然而，我们可利用两种重要的先验知识来获得无论数据集大小均明确定义的结果：已知的因果结构（例如，已知潜在子群体导致观测信号而非相反）以及对观测模型错误程度的大致感知（例如，基于少量专家标注数据或对数据生成过程的某种理解）。我们提出了一种新的模型选择准则，该准则基于模型，利用这些可用知识获得对观测模型误设具有鲁棒性的混合模型推断。我们通过证明在直观假设下首个此类一致性结果，为方法提供了理论支持。仿真研究与流式细胞术数据应用表明，我们的模型选择准则能够始终如一地找到正确的子群体数量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日