The rise in internet usage has led to the generation of massive amounts of data, resulting in the adoption of various supervised and semi-supervised machine learning algorithms, which can effectively utilize the colossal amount of data to train models. However, before deploying these models in the real world, these must be strictly evaluated on performance measures like worst-case recall and satisfy constraints such as fairness. We find that current state-of-the-art empirical techniques offer sub-optimal performance on these practical, non-decomposable performance objectives. On the other hand, the theoretical techniques necessitate training a new model from scratch for each performance objective. To bridge the gap, we propose SelMix, a selective mixup-based inexpensive fine-tuning technique for pre-trained models, to optimize for the desired objective. The core idea of our framework is to determine a sampling distribution to perform a mixup of features between samples from particular classes such that it optimizes the given objective. We comprehensively evaluate our technique against the existing empirical and theoretically principled methods on standard benchmark datasets for imbalanced classification. We find that proposed SelMix fine-tuning significantly improves the performance for various practical non-decomposable objectives across benchmarks.
翻译:互联网使用的增长导致了大量数据的产生,从而推动了各种监督和半监督机器学习算法的采用,这些算法能够有效利用海量数据来训练模型。然而,在将这些模型部署到现实世界之前,必须严格评估其在最坏情况召回率等性能指标上的表现,并满足公平性等约束。我们发现,当前最先进的实证技术在这些实用的、不可分解的性能目标上表现欠佳。另一方面,理论技术需要针对每个性能目标从头训练新模型。为弥合这一差距,我们提出SelMix——一种基于选择性混合的廉价微调技术,用于预训练模型,以优化所需目标。我们框架的核心思想是确定一个采样分布,对特定类别的样本进行特征混合,从而优化给定目标。我们全面评估了我们的技术,与现有的实证和理论原则方法在标准不平衡分类基准数据集上进行了比较。我们发现,所提出的SelMix微调显著提升了基准测试中各种实用不可分解目标的性能。