An accurate and substantial dataset is necessary to train a reliable and well-performing model. However, even manually labeled datasets contain errors, not to mention automatically labeled ones. The problem of data denoising was addressed in different existing research, most of which focuses on the detection of outliers and their permanent removal - a process that is likely to over- or underfilter the dataset. In this work, we propose AGRA: a new method for Adaptive GRAdient-based outlier removal. Instead of cleaning the dataset prior to model training, the dataset is adjusted during the training process. By comparing the aggregated gradient of a batch of samples and an individual example gradient, our method dynamically decides whether a corresponding example is helpful for the model at this point or is counter-productive and should be left out for the current update. Extensive evaluation on several datasets demonstrates the AGRA effectiveness, while comprehensive results analysis supports our initial hypothesis: permanent hard outlier removal is not always what model benefits the most from.
翻译:准确且充足的数据集对于训练可靠且性能良好的模型至关重要。然而,即便是人工标注的数据集也包含错误,更不用说自动标注的数据集了。现有研究已涉及数据去噪问题,但多数方法侧重于检测离群点并永久删除——这一过程可能导致数据集被过度或欠过滤。本文提出AGRA方法:一种新颖的自适应梯度离群点移除算法。该方法并非在模型训练前清洗数据集,而是在训练过程中动态调整数据。通过比较样本批次聚合梯度与单个样本梯度,我们的方法能动态判断当前样本是否对模型有益,或是否具有反作用而应在当前更新中忽略。在多个数据集上的广泛评估证明了AGRA的有效性,而全面的结果分析支持了我们的初始假设:模型并非总能从永久移除硬离群点中获益最大。