High dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable parsimonious models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of identifiable multilayer (potentially deep) discrete latent structure models for discrete data, termed Bayesian pyramids. We establish the identifiability of Bayesian pyramids by developing novel transparent conditions on the pyramid-shaped deep latent directed graph. The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate the identifiability and estimability of model parameters. Applications of the methodology to DNA nucleotide sequence data uncover useful discrete latent features that are highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.
翻译:高维分类数据在生物医学和社会科学中常规采集。构建兼具维度约简能力、可解释且简约的模型,以揭示离散数据中有意义的潜结构,具有重要意义。可识别性是此类场景下有效建模与推断的基本前提,但在存在复杂潜结构时极具挑战性。本文针对离散数据提出一类可识别的多层(可能为深层)离散潜结构模型,称为"贝叶斯金字塔"。我们通过建立关于金字塔形深层潜有向图的新型透明条件,确立了贝叶斯金字塔的可识别性。所提出的可识别性条件可在适当先验下确保贝叶斯后验一致性。作为例证,我们考虑两层潜模型并提出一种贝叶斯收缩估计方法。该模型的仿真结果验证了模型参数的可识别性与可估计性。将所提方法应用于DNA核苷酸序列数据,揭示了具有高度序列类型预测能力的有用离散潜特征。本框架为离散数据的可解释无监督学习提供了范式,并可成为主流机器学习方法的有益替代方案。