This paper introduces a novel generator called Perturbation-Assisted Sample Synthesis (PASS), designed for drawing reliable conclusions from complex data, especially when using advanced modeling techniques like deep neural networks. PASS utilizes perturbation to generate synthetic data that closely mirrors the distribution of raw data, encompassing numerical and unstructured data types such as gene expression, images, and text. By estimating the data-generating distribution and leveraging large pre-trained generative models, PASS enhances estimation accuracy, providing an estimated distribution of any statistic through Monte Carlo experiments. Building on PASS, we propose a generative inference framework called Perturbation-Assisted Inference (PAI), which offers a statistical guarantee of validity. In pivotal inference, PAI enables accurate conclusions without knowing a pivotal's distribution as in simulations, even with limited data. In non-pivotal situations, we train PASS using an independent holdout sample, resulting in credible conclusions. To showcase PAI's capability in tackling complex problems, we highlight its applications in three domains: image synthesis inference, sentiment word inference, and multimodal inference via stable diffusion.
翻译:本文引入了一种名为“扰动辅助样本生成”(PASS)的新型生成器,旨在从复杂数据(尤其是在使用深度神经网络等先进建模技术时)中得出可靠结论。PASS利用扰动生成与原始数据分布高度相似的合成数据,涵盖基因表达、图像和文本等数值及非结构化数据类型。通过估计数据生成分布并利用大型预训练生成模型,PASS提升了估计精度,并通过蒙特卡洛实验提供任意统计量的估计分布。基于PASS,我们提出了一种名为“扰动辅助推断”(PAI)的生成式推断框架,该框架具备统计有效性的理论保证。在枢轴推断中,PAI能够像模拟实验那样在无需了解枢轴量分布的情况下(即使数据有限)得出准确结论。在非枢轴情境下,我们使用独立保留样本训练PASS,从而得出可信结论。为了展示PAI解决复杂问题的能力,我们重点介绍了其在三个领域的应用:图像合成推断、情感词推断以及基于稳定扩散的多模态推断。