Imbalanced datasets present a significant challenge for machine learning models, often leading to biased predictions. To address this issue, data augmentation techniques are widely used in natural language processing (NLP) to generate new samples for the minority class. However, in this paper, we challenge the common assumption that data augmentation is always necessary to improve predictions on imbalanced datasets. Instead, we argue that adjusting the classifier cutoffs without data augmentation can produce similar results to oversampling techniques. Our study provides theoretical and empirical evidence to support this claim. Our findings contribute to a better understanding of the strengths and limitations of different approaches to dealing with imbalanced data, and help researchers and practitioners make informed decisions about which methods to use for a given task.
翻译:不平衡数据集对机器学习模型构成了重大挑战,常导致预测结果出现偏差。为解决这一问题,自然语言处理中广泛采用数据增强技术,为少数类生成新样本。然而,本文对"数据增强总是能提升不平衡数据集预测性能"这一常见假设提出质疑。我们认为,在不进行数据增强的情况下调整分类器截断值,能够产生与过采样技术相似的效果。本研究从理论和实证两方面提供了支撑这一论点的证据。我们的发现有助于更深入地理解应对不平衡数据时不同方法的优势与局限,并帮助研究人员从业者在特定任务中做出明智的方法选择。