It is impossible today to pretend that the practice of machine learning is compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures. This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups. This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.
翻译:如今已无法假装机器学习实践与训练和测试数据遵循相同分布的观点兼容。近期多位研究者采用集成技术表明,针对多重数据分布场景的最佳表征,既优于为取得最佳分布内性能而正则化得到的表征,也优于在常见随机梯度过程隐含稀疏偏差影响下得到的表征。本研究探讨使用极高丢弃率(而非集成方法)来获取此类丰富表征。尽管从零开始使用如此高的丢弃率训练深度网络几乎不可能,但在这种条件下对大型预训练模型进行微调不仅可行,而且所获得的分布外性能甚至超越了集成方法和权重平均方法(如模型汤)。这一结果具有实际意义,因为近年来微调场景的重要性显著提升。同时,该研究为丰富表征的本质以及使用相对较小数据集微调大型网络的内在线性特性提供了重要洞见。