Repro Samples Method for a Performance Guaranteed Inference in General and Irregular Inference Problems

Rapid advancements in data science require us to have fundamentally new frameworks to tackle prevalent but highly non-trivial "irregular" inference problems, to which the large sample central limit theorem does not apply. Typical examples are those involving discrete or non-numerical parameters and those involving non-numerical data, etc. In this article, we present an innovative, wide-reaching, and effective approach, called "repro samples method," to conduct statistical inference for these irregular problems plus more. The development relates to but improves several existing simulation-inspired inference approaches, and we provide both exact and approximate theories to support our development. Moreover, the proposed approach is broadly applicable and subsumes the classical Neyman-Pearson framework as a special case. For the often-seen irregular inference problems that involve both discrete/non-numerical and continuous parameters, we propose an effective three-step procedure to make inferences for all parameters. We also develop a unique matching scheme that turns the discreteness of discrete/non-numerical parameters from an obstacle for forming inferential theories into a beneficial attribute for improving computational efficiency. We demonstrate the effectiveness of the proposed general methodology using various examples, including a case study example on a Gaussian mixture model with unknown number of components. This case study example provides a solution to a long-standing open inference question in statistics on how to quantify the estimation uncertainty for the unknown number of components and other associated parameters. Real data and simulation studies, with comparisons to existing approaches, demonstrate the far superior performance of the proposed method.

翻译：数据科学的快速发展要求我们建立全新的框架，以应对普遍存在但具有高度非平凡性的"不规则"推断问题——这类问题中，大样本中心极限定理不再适用。典型例子包括涉及离散或非数值参数的问题，以及涉及非数值数据等情形。本文提出一种创新、广谱且高效的方法——"重现样本法"，用于解决这些不规则问题及其他更广泛的统计推断挑战。该方法借鉴并改进了多种现有的基于模拟的推断方法，我们为其发展提供了精确理论与近似理论的双重支撑。更重要的是，所提方法具有广泛适用性，并将经典奈曼-皮尔逊框架作为特例包含在内。针对常见的不规则推断问题（涉及离散/非数值参数与连续参数的混合情形），我们提出一套有效的三步法，用于对所有参数进行推断。同时，我们开发了独特的匹配方案，将离散/非数值参数的离散性从理论构建的障碍转化为提升计算效率的优势。我们通过多个案例验证了该通用方法的有效性，包括一个关于未知分量数的高斯混合模型实例研究。该案例研究为解决统计学领域一个长期悬而未决的推断问题——如何量化未知分量数及其他相关参数的估计不确定性——提供了解决方案。基于真实数据与模拟研究的对比实验表明，所提方法在性能上显著优于现有方法。