Heterogeneous, mixed type datasets including both continuous and categorical variables are ubiquitous, and enriches data analysis by allowing for more complex relationships and interactions to be modelled. Mixture models offer a flexible framework for capturing the underlying heterogeneity and relationships in mixed type datasets. Most current approaches for modelling mixed data either forgo uncertainty quantification and only conduct point estimation, and some use MCMC which incurs a very high computational cost that is not scalable to large datasets. This paper develops a coordinate ascent variational inference algorithm (CAVI) for mixture models on mixed (continuous and categorical) data, which circumvents the high computational cost of MCMC while retaining uncertainty quantification. We demonstrate our approach through simulation studies as well as an applied case study of the NHANES risk factor dataset. We provide theoretical justification for our method by establishing that the CAVI variational posterior mean converges locally to the true parameter value at a gap of $O(1/n)$ from the maximum likelihood estimator. Building on this result, we show that the CAVI variational posterior contracts around the true parameter at $O(n^{-1/2})$ rate.
翻译:包含连续变量和分类变量的异质混合类型数据集普遍存在,通过建模更复杂的关系和交互作用,丰富了数据分析。混合模型为捕捉混合类型数据集中的潜在异质性和关系提供了灵活的框架。当前大多数混合数据建模方法要么放弃不确定性量化仅进行点估计,要么使用计算成本极高且难以扩展至大规模数据集的马尔可夫链蒙特卡洛方法。本文针对混合(连续与分类)数据的混合模型,开发了一种坐标上升变分推断算法,在保留不确定性量化的同时规避了MCMC的高计算成本。我们通过模拟研究以及NHANES风险因素数据集的应用案例研究展示了该方法。我们通过证明CAVI变分后验均值以$O(1/n)$的差距局部收敛于真实参数值(相对于最大似然估计量),为方法提供了理论依据。基于此结果,我们进一步证明CAVI变分后验以$O(n^{-1/2})$的速率收缩至真实参数。