The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around any statistical inference procedure to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard inference method using only real data when synthetic data is of low quality. The error of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.
翻译:高质量合成数据的迅速普及——无论是通过先进的人工智能模型生成,还是作为相关任务的辅助数据收集而来——为统计推断带来了机遇与挑战。本文提出了一种通用合成驱动推断(GESPI)框架,该框架可包裹任何统计推断过程,通过结合合成数据与真实数据来安全地提升样本效率。我们的框架利用高质量合成数据来增强统计功效,但当合成数据质量较低时,又能自适应地退回到仅使用真实数据的标准推断方法。我们的方法误差始终低于用户指定的界限,且无需对合成数据做任何分布假设,并随着合成数据质量的提高而减小。这种灵活性使得我们的框架能够与符合预测、风险控制、假设检验以及多重检验过程无缝集成,且无需修改基础推断方法。我们在标注数据有限的挑战性任务上展示了我们方法的优势,包括AlphaFold蛋白质结构预测,以及在复杂数学问题上比较大型推理模型。