We provide results that exactly quantify how data augmentation affects the variance and limiting distribution of estimates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. Our main theoretical tool is a limit theorem for functions of randomly transformed, high-dimensional random vectors. The proof draws on work in probability on noise stability of functions of many variables.
翻译:本文精确量化了数据增强如何影响估计量的方差和极限分布,并详细分析了多个具体模型。研究结果验证了机器学习实践中的某些观测,但也揭示了出人意料的现象:数据增强可能增大而非减小估计不确定性(如经验预测风险)。数据增强虽能作为正则化手段,但在某些高维问题中失效,并可能改变经验风险的双峰下降峰值位置。总体而言,分析表明数据增强的若干被归因特性并非绝对成立,而是取决于多种因素的组合——特别是数据分布、估计量性质、样本量、增强次数与维度间的相互作用。本文的主要理论工具是随机变换高维随机向量函数的极限定理,证明过程借鉴了多变量函数噪声稳定性的概率论研究成果。