Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data. Moreover, we find that data augmentations which encourage invariances can be more valuable than invariance alone, especially on small and medium sized training sets. Following this observation, we show that augmentations induce additional stochasticity during training, effectively flattening the loss landscape.
翻译:尽管数据增强在性能提升方面效果显著,但其有效性背后的机制仍鲜为人知。本文厘清了数据增强发挥作用的若干关键机制。通过建立增强数据与真实新增数据之间的换算关系,我们发现:在分布外测试场景中,能够产生多样性样本但与其数据分布不一致的增强方法,其价值甚至可能超过新增训练数据。此外,我们观察到:相较于单纯引入不变性约束,能够促进不变性的数据增强方法在中小规模训练集上更具价值。基于这一发现,我们证明数据增强会在训练过程中引入额外随机性,从而有效平滑损失景观。