Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.
翻译:预训练机器学习模型的数据通常由多个异质数据集的集合构成。虽然在不区分目标领域的情况下,直接在这些数据集的并集上进行训练是合理的,但当预训练模型最终将应用的目标领域已知时,这种通用训练方式可能并非最优。在此情况下,理想的做法是仅使用与目标领域最相似的数据集进行预训练。本文并非局限于从预训练集合中已有的数据集中进行选择,而是探索将搜索范围扩展至所有可通过这些数据集的"组合"方式合成的新数据集。我们将此类组合定义为多数据集插值,并通过最优传输理论中的广义测地线概念加以形式化。我们利用最近提出的带标签数据集间距离度量来计算这些测地线,并基于此推导出两种插值方案:分别采用重心投影法和最优传输映射法(后者通过最新的神经最优传输方法实现)。这些方法具有可扩展性、高效性,且尤其值得注意的是,它们能够对具有不同且不相关标签集的数据集进行插值。通过计算机视觉领域中的多项迁移学习实验,我们证明这是一种面向目标定向合成数据集的富有前景的新方法。