Overcoming Selection Bias in Statistical Studies With Amortized Bayesian Inference

Selection bias arises when the probability that an observation enters a dataset depends on variables related to the quantities of interest, leading to systematic distortions in estimation and uncertainty quantification. For example, in epidemiological or survey settings, individuals with certain outcomes may be more likely to be included, resulting in biased prevalence estimates with potentially substantial downstream impact. Classical corrections, such as inverse-probability weighting or explicit likelihood-based models of the selection process, rely on tractable likelihoods, which limits their applicability in complex stochastic models with latent dynamics or high-dimensional structure. Simulation-based inference enables Bayesian analysis without tractable likelihoods but typically assumes missingness at random and thus fails when selection depends on unobserved outcomes or covariates. Here, we develop a bias-aware simulation-based inference framework that explicitly incorporates selection into neural posterior estimation. By embedding the selection mechanism directly into the generative simulator, the approach enables amortized Bayesian inference without requiring tractable likelihoods. This recasting of selection bias as part of the simulation process allows us to both obtain debiased estimates and explicitly test for the presence of bias. The framework integrates diagnostics to detect discrepancies between simulated and observed data and to assess posterior calibration. The method recovers well-calibrated posterior distributions across three statistical applications with diverse selection mechanisms, including settings in which likelihood-based approaches yield biased estimates. These results recast the correction of selection bias as a simulation problem and establish simulation-based inference as a practical and testable strategy for parameter estimation under selection bias.

翻译：选择偏差产生于观测数据进入数据集的概率依赖于与感兴趣变量相关的因素时，这会导致估计和不确定性量化出现系统性失真。例如，在流行病学或调查场景中，具有特定结果的个体更可能被纳入样本，从而导致患病率估计产生偏差，并可能对下游分析造成重大影响。经典校正方法（如逆概率加权或基于显式似然的选择过程模型）依赖可处理的似然函数，这限制了它们在具有潜在动力学或高维结构的复杂随机模型中的适用性。基于模拟的推断无需可处理似然函数即可实现贝叶斯分析，但通常假设数据随机缺失，因此当选择过程依赖于未观测结果或协变量时会失效。本文开发了一种偏差感知的基于模拟推断框架，将选择偏差显式纳入神经后验估计。通过将选择机制直接嵌入生成式模拟器，该方法无需可处理似然函数即可实现摊销贝叶斯推断。这种将选择偏差重构为模拟过程组成部分的策略，使我们既能获得去偏估计，又能显式检验偏差是否存在。该框架集成了用于检测模拟数据与观测数据差异的诊断工具，以及评估后验校准的方法。在包含不同选择机制的三个统计应用中，该方法恢复了良好校准的后验分布，包括在基于似然方法产生有偏估计的场景中。这些结果将选择偏差校正重新定义为模拟问题，并确立了基于模拟的推断作为选择偏差下参数估计的实用且可检验策略。