A Bias-Variance Decomposition for Ensembles over Multiple Synthetic Datasets

Recent studies have highlighted the benefits of generating multiple synthetic datasets for supervised learning, from increased accuracy to more effective model selection and uncertainty estimation. These benefits have clear empirical support, but the theoretical understanding of them is currently very light. We seek to increase the theoretical understanding by deriving bias-variance decompositions for several settings of using multiple synthetic datasets, including differentially private synthetic data. Our theory predicts multiple synthetic datasets to be especially beneficial for high-variance downstream predictors, and yields a simple rule of thumb to select the appropriate number of synthetic datasets in the case of mean-squared error and Brier score. We investigate how our theory works in practice by evaluating the performance of an ensemble over many synthetic datasets for several real datasets and downstream predictors. The results follow our theory, showing that our insights are practically relevant.

翻译：近期研究强调了为监督学习生成多合成数据集的优势，包括提升准确率、优化模型选择及改进不确定性估计等方面。这些优势已获得明确的实证支持，但其理论理解目前仍较为薄弱。本文通过推导多合成数据集使用场景（包括差分隐私合成数据）的偏差-方差分解，致力于深化理论认知。我们的理论预测表明，多合成数据集对高方差下游预测器尤其有益，并针对均方误差和Brier评分场景提出了选择合适合成数据集数量的简易经验法则。通过在多组真实数据集和下游预测器上评估基于大量合成数据集的集成性能，我们验证了理论的实际适用性。实验结果与理论预测相符，证实了所提见解的实践相关性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日