Generating Synthetic Datasets by Interpolating along Generalized Geodesics

Data for pretraining machine learning models often consists of collections of heterogeneous datasets. Although training on their union is reasonable in agnostic settings, it might be suboptimal when the target domain -- where the model will ultimately be used -- is known in advance. In that case, one would ideally pretrain only on the dataset(s) most similar to the target one. Instead of limiting this choice to those datasets already present in the pretraining collection, here we explore extending this search to all datasets that can be synthesized as `combinations' of them. We define such combinations as multi-dataset interpolations, formalized through the notion of generalized geodesics from optimal transport (OT) theory. We compute these geodesics using a recent notion of distance between labeled datasets, and derive alternative interpolation schemes based on it: using either barycentric projections or optimal transport maps, the latter computed using recent neural OT methods. These methods are scalable, efficient, and -- notably -- can be used to interpolate even between datasets with distinct and unrelated label sets. Through various experiments in transfer learning in computer vision, we demonstrate this is a promising new approach for targeted on-demand dataset synthesis.

翻译：预训练机器学习模型的数据通常由多个异质数据集的集合构成。虽然在不区分目标领域的情况下，直接在这些数据集的并集上进行训练是合理的，但当预训练模型最终将应用的目标领域已知时，这种通用训练方式可能并非最优。在此情况下，理想的做法是仅使用与目标领域最相似的数据集进行预训练。本文并非局限于从预训练集合中已有的数据集中进行选择，而是探索将搜索范围扩展至所有可通过这些数据集的"组合"方式合成的新数据集。我们将此类组合定义为多数据集插值，并通过最优传输理论中的广义测地线概念加以形式化。我们利用最近提出的带标签数据集间距离度量来计算这些测地线，并基于此推导出两种插值方案：分别采用重心投影法和最优传输映射法（后者通过最新的神经最优传输方法实现）。这些方法具有可扩展性、高效性，且尤其值得注意的是，它们能够对具有不同且不相关标签集的数据集进行插值。通过计算机视觉领域中的多项迁移学习实验，我们证明这是一种面向目标定向合成数据集的富有前景的新方法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日