Gauging the performance of ML models on data from unseen domains at test-time is essential yet a challenging problem due to the lack of labels in this setting. Moreover, the performance of these models on in-distribution data is a poor indicator of their performance on data from unseen domains. Thus, it is essential to develop metrics that can provide insights into the model's performance at test time and can be computed only with the information available at test time (such as their model parameters, the training data or its statistics, and the unlabeled test data). To this end, we propose a metric based on Optimal Transport that is highly correlated with the model's performance on unseen domains and is efficiently computable only using information available at test time. Concretely, our metric characterizes the model's performance on unseen domains using only a small amount of unlabeled data from these domains and data or statistics from the training (source) domain(s). Through extensive empirical evaluation using standard benchmark datasets, and their corruptions, we demonstrate the utility of our metric in estimating the model's performance in various practical applications. These include the problems of selecting the source data and architecture that leads to the best performance on data from an unseen domain and the problem of predicting a deployed model's performance at test time on unseen domains. Our empirical results show that our metric, which uses information from both the source and the unseen domain, is highly correlated with the model's performance, achieving a significantly better correlation than that obtained via the popular prediction entropy-based metric, which is computed solely using the data from the unseen domain.
翻译:在测试时评估机器学习模型在未知域数据上的性能是一项关键但具有挑战性的问题,因为该场景下缺乏标签信息。此外,这些模型在分布内数据上的性能并不能很好地反映其在未知域数据上的表现。因此,开发能够在测试时提供模型性能洞察的指标至关重要,且这些指标仅需利用测试时可获得的信息(如模型参数、训练数据或其统计量,以及无标签的测试数据)即可计算。为此,我们提出一种基于最优传输的指标,该指标与模型在未知域上的性能高度相关,且仅利用测试时可用信息即可高效计算。具体而言,我们的指标仅使用少量来自未知域的无标签数据以及来自训练(源)域的数据或统计量,即可表征模型在未知域上的性能。通过使用标准基准数据集及其损坏版本进行的大量实证评估,我们证明了该指标在估计模型性能方面的实用性,适用于多种实际应用场景。这些应用包括:选择能使模型在未知域数据上取得最佳性能的源数据和架构,以及预测部署模型在测试时对未知域数据的性能。我们的实证结果表明,该指标同时利用了源域和未知域的信息,与模型性能高度相关,其相关性显著优于仅使用未知域数据计算的流行预测熵指标。