Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory predicts multiple synthetic datasets to be especially beneficial for high-variance downstream predictors, and yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice by evaluating the performance of an ensemble over many synthetic datasets for several real datasets and downstream predictors. The results follow our theory, showing that our insights are practically relevant.
翻译:近期研究强调了为监督学习生成多合成数据集的优势,包括提升准确率、优化模型选择及改进不确定性估计等方面。这些优势已获得明确的实证支持,但其理论理解目前仍较为薄弱。本文通过推导多合成数据集使用场景(包括差分隐私合成数据)的偏差-方差分解,致力于深化理论认知。我们的理论预测表明,多合成数据集对高方差下游预测器尤其有益,并针对均方误差和Brier评分场景提出了选择合适合成数据集数量的简易经验法则。通过在多组真实数据集和下游预测器上评估基于大量合成数据集的集成性能,我们验证了理论的实际适用性。实验结果与理论预测相符,证实了所提见解的实践相关性。