Sample re-weighting strategies provide a promising mechanism to deal with imperfect training data in machine learning, such as noisily labeled or class-imbalanced data. One such strategy involves formulating a bi-level optimization problem called the meta re-weighting problem, whose goal is to optimize performance on a small set of perfect pivotal samples, called meta samples. Many approaches have been proposed to efficiently solve this problem. However, all of them assume that a perfect meta sample set is already provided while we observe that the selections of meta sample set is performance critical. In this paper, we study how to learn to identify such a meta sample set from a large, imperfect training set, that is subsequently cleaned and used to optimize performance in the meta re-weighting setting. We propose a learning framework which reduces the meta samples selection problem to a weighted K-means clustering problem through rigorously theoretical analysis. We propose two clustering methods within our learning framework, Representation-based clustering method (RBC) and Gradient-based clustering method (GBC), for balancing performance and computational efficiency. Empirical studies demonstrate the performance advantage of our methods over various baseline methods.
翻译:样本重加权策略为处理机器学习中不完美训练数据(如噪声标签或类别不平衡数据)提供了一种有前景的机制。其中一种策略涉及构建一个称为元重加权问题的双层优化问题,其目标是在一小组完美的关键样本(称为元样本)上优化性能。已有许多方法被提出以高效解决该问题。然而,所有这些方法均假设已预先提供了一组完美的元样本集合,而我们观察到元样本集合的选择对性能至关重要。本文研究了如何从大规模不完美训练集中学习识别这样的元样本集合,该集合随后被清洗并用于优化元重加权设置下的性能。通过严格的理论分析,我们提出了一种学习框架,将元样本选择问题简化为加权K-means聚类问题。在该框架内,我们提出了两种聚类方法:基于表示的聚类方法(RBC)和基于梯度的聚类方法(GBC),以平衡性能与计算效率。实验研究表明,我们的方法相比各种基线方法具有显著的性能优势。