Discrete data are abundant and often arise as counts or rounded data. These data commonly exhibit complex distributional features such as zero-inflation, over-/under-dispersion, boundedness, and heaping, which render many parametric models inadequate. Yet even for parametric regression models, approximations such as MCMC typically are needed for posterior inference. This paper introduces a Bayesian modeling and algorithmic framework that enables semiparametric regression analysis for discrete data with Monte Carlo (not MCMC) sampling. The proposed approach pairs a nonparametric marginal model with a latent linear regression model to encourage both flexibility and interpretability, and delivers posterior consistency even under model misspecification. For a parametric or large-sample approximation of this model, we identify a class of conjugate priors with (pseudo) closed-form posteriors. All posterior and predictive distributions are available analytically or via direct Monte Carlo sampling. These tools are broadly useful for linear regression, nonlinear models via basis expansions, and variable selection with discrete data. Simulation studies demonstrate significant advantages in computing, prediction, estimation, and selection relative to existing alternatives. This novel approach is applied successfully to self-reported mental health data that exhibit zero-inflation, overdispersion, boundedness, and heaping.
翻译:离散数据广泛存在,通常以计数或取整数据的形式出现。这类数据常呈现零膨胀、过度/不足离散、有界性及堆积等复杂分布特征,导致许多参数模型难以适用。即便对于参数回归模型,通常仍需借助MCMC等近似方法进行后验推断。本文提出一种贝叶斯建模与算法框架,通过蒙特卡洛(非MCMC)采样实现对离散数据的半参数回归分析。该方法将非参数边际模型与潜变量线性回归模型相结合,兼顾灵活性与可解释性,并在模型设定错误时仍能保持后验一致性。对于该模型的参数化或大样本近似,我们识别出一类具有(伪)闭式后验的共轭先验。所有后验与预测分布均可通过解析计算或直接蒙特卡洛采样获得。这些工具广泛适用于线性回归、基于基展开的非线性模型以及离散数据的变量选择问题。模拟研究证明,该方法在计算效率、预测精度、参数估计及变量选择方面相较现有方法具有显著优势。该新方法被成功应用于自报心理健康数据,这些数据表现出零膨胀、过离散、有界性和堆积等特征。