Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources

from arxiv, An extended abstract of this work appears in Data-centric Machine Learning Research (DMLR) Workshop at 40th International Conference on Machine Learning, Honolulu HI, USA. July 29, 2023

Traditionally, data selection has been studied in settings where all samples from prospective sources are fully revealed to a machine learning developer. However, in practical data exchange scenarios, data providers often reveal only a limited subset of samples before an acquisition decision is made. Recently, there have been efforts to fit scaling laws that predict model performance at any size and data source composition using the limited available samples. However, these scaling functions are black-box, computationally expensive to fit, highly susceptible to overfitting, or/and difficult to optimize for data selection. This paper proposes a framework called <projektor>, which predicts model performance and supports data selection decisions based on partial samples of prospective data sources. Our approach distinguishes itself from existing work by introducing a novel *two-stage* performance inference process. In the first stage, we leverage the Optimal Transport distance to predict the model's performance for any data mixture ratio within the range of disclosed data sizes. In the second stage, we extrapolate the performance to larger undisclosed data sizes based on a novel parameter-free mapping technique inspired by neural scaling laws. We further derive an efficient gradient-based method to select data sources based on the projected model performance. Evaluation over a diverse range of applications demonstrates that <projektor> significantly improves existing performance scaling approaches in terms of both the accuracy of performance inference and the computation costs associated with constructing the performance predictor. Also, <projektor> outperforms by a wide margin in data selection effectiveness compared to a range of other off-the-shelf solutions.

翻译：传统上，数据选择研究通常假设机器学习开发者能够获取候选数据源的全部样本。然而在实际数据交换场景中，数据提供方在决策前往往仅公开有限样本子集。近期已有研究尝试利用有限可用样本拟合缩放定律，以预测任意数据规模与数据源组合下的模型性能。但现有缩放函数存在黑箱特性、拟合计算成本高、易过拟合或难以优化数据选择等局限。本文提出名为<projektor>的框架，可基于候选数据源的部分样本预测模型性能并辅助数据选择决策。本方法的核心创新在于提出新颖的**两阶段**性能推断流程：第一阶段利用最优传输距离预测公开数据规模范围内任意数据混合比例下的模型性能；第二阶段基于受神经缩放定律启发的无参数映射技术，将性能推断外推至更大的未公开数据规模。我们进一步推导了基于梯度的数据源选择方法，通过投影后的模型性能进行数据筛选。在多样化应用场景上的评估表明，<projektor>在性能推断精度与构建性能预测器的计算开销方面均显著优于现有性能缩放方法。同时，相较于多种现成解决方案，<projektor>在数据选择有效性上展现出压倒性优势。