Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets. Our theory predicts multiple synthetic datasets to be especially beneficial for high-variance downstream predictors, and yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice by evaluating the performance of an ensemble over many synthetic datasets for several real datasets and downstream predictors. The results follow our theory, showing that our insights are also practically relevant.
翻译:近期研究表明,在监督学习中生成多个合成数据集可带来多重优势,包括提升准确率、优化模型选择及改进不确定性估计。这些优势虽已获得明确实验支持,但理论基础仍较为薄弱。本文通过推导使用多合成数据集场景下的偏差-方差分解,试图提升对其理论机制的理解。理论预测表明,多合成数据集对高方差下游预测器尤为有效,并针对均方误差和布里尔分数情境,形成了选择合适合成数据集数量的简洁经验法则。我们通过在多个真实数据集和下游预测器上评估多合成数据集集成的实际表现,验证了理论的有效性。实验结果与理论预测高度一致,证实了我们的理论洞见具有实际应用价值。