The rapid proliferation of high-quality synthetic data -- generated by advanced AI models or collected as auxiliary data from related tasks -- presents both opportunities and challenges for statistical inference. This paper introduces a GEneral Synthetic-Powered Inference (GESPI) framework that wraps around a broad class of statistical inference procedures to safely enhance sample efficiency by combining synthetic and real data. Our framework leverages high-quality synthetic data to boost statistical power, yet adaptively defaults to the standard method using only real data when synthetic data are of low quality. The error rate of our method remains below a user-specified bound without any distributional assumptions on the synthetic data, and decreases as the quality of the synthetic data improves. This flexibility enables seamless integration with conformal prediction, risk control, hypothesis testing, and multiple testing procedures, all without modifying the base inference method. We demonstrate the benefits of our method on challenging tasks with limited labeled data, including AlphaFold protein structure prediction, and comparing large reasoning models on complex math problems.
翻译:高质量合成数据的快速涌现——这些数据或由先进AI模型生成,或作为相关任务的辅助数据收集而来——为统计推断带来了机遇与挑战。本文提出通用合成增强推断(GESPI)框架,该框架可适配各类统计推断流程,通过结合合成数据与真实数据安全提升样本效率。框架利用高质量合成数据增强统计效能,同时在合成数据质量较低时,自适应地回退至仅使用真实数据的标准方法。该方法在不对合成数据做任何分布假设的前提下,其错误率始终低于用户指定阈值,且随合成数据质量提升而递减。这种灵活性使其能够无缝集成至共形预测、风险控制、假设检验及多重检验流程中,且无需修改基础推断方法。我们在标注数据稀缺的挑战性任务中验证了该方法的效果,具体任务包括AlphaFold蛋白质结构预测以及大型推理模型在复杂数学问题上的性能比较。