High-dimensional data are routinely collected in many areas. We are particularly interested in Bayesian classification models in which one or more variables are imbalanced. Current Markov chain Monte Carlo algorithms for posterior computation are inefficient as $n$ and/or $p$ increase due to worsening time per step and mixing rates. One strategy is to use a gradient-based sampler to improve mixing while using data sub-samples to reduce per-step computational complexity. However, usual sub-sampling breaks down when applied to imbalanced data. Instead, we generalize piece-wise deterministic Markov chain Monte Carlo algorithms to include importance-weighted and mini-batch sub-sampling. These approaches maintain the correct stationary distribution with arbitrarily small sub-samples, and substantially outperform current competitors. We provide theoretical support and illustrate gains in simulated and real data applications.
翻译:高维数据在众多领域中已成为常规收集对象。我们特别关注贝叶斯分类模型中一个或多个变量存在不平衡性的情况。当样本量n和/或变量维度p增加时,当前用于后验计算的马尔可夫链蒙特卡洛算法因每步耗时增长和混合速率下降而效率低下。一种策略是采用基于梯度的采样器以改善混合性能,同时利用数据子样本降低每步计算复杂度。然而,当应用于不平衡数据时,常规子采样方法会失效。为此,我们将军用分段确定性马尔可夫链蒙特卡洛算法扩展为包含重要性加权子采样和小批量子采样。这些方法能在任意小规模的子样本下维持正确的平稳分布,且显著优于现有竞品。我们提供了理论支持,并通过模拟数据和实际数据应用展示了其性能优势。