Imbalanced data distributions are prevalent in real-world scenarios, posing significant challenges in both imbalanced classification and imbalanced regression tasks. They often cause deep learning models to overfit in areas of high sample density (many-shot regions) while underperforming in areas of low sample density (few-shot regions). This characteristic restricts the utility of deep learning models in various sectors, notably healthcare, where areas with few-shot data hold greater clinical relevance. While recent studies have shown the benefits of incorporating distribution information in imbalanced classification tasks, such strategies are rarely explored in imbalanced regression. In this paper, we address this issue by introducing a novel loss function, termed Dist Loss, designed to minimize the distribution distance between the model's predictions and the target labels in a differentiable manner, effectively integrating distribution information into model training. Dist Loss enables deep learning models to regularize their output distribution during training, effectively enhancing their focus on few-shot regions. We have conducted extensive experiments across three datasets spanning computer vision and healthcare: IMDB-WIKI-DIR, AgeDB-DIR, and ECG-Ka-DIR. The results demonstrate that Dist Loss effectively mitigates the negative impact of imbalanced data distribution on model performance, achieving state-of-the-art results in sparse data regions. Furthermore, Dist Loss is easy to integrate, complementing existing methods.
翻译:现实场景中普遍存在数据分布不平衡的问题,这给不平衡分类和不平衡回归任务带来了重大挑战。此类问题常导致深度学习模型在样本密度高的区域(多样本区域)过拟合,而在样本密度低的区域(少样本区域)表现不佳。这一特性限制了深度学习模型在诸多领域的应用价值,尤其在医疗健康领域,少样本数据区域往往具有更重要的临床意义。尽管近期研究表明,在不平衡分类任务中引入分布信息具有显著优势,但此类策略在不平衡回归任务中却鲜有探索。本文针对这一问题,提出了一种新颖的损失函数——Dist Loss,该函数通过可微分的方式最小化模型预测值与目标标签之间的分布距离,从而将分布信息有效整合到模型训练中。Dist Loss使深度学习模型能够在训练过程中对其输出分布进行正则化,有效增强模型对少样本区域的关注。我们在涵盖计算机视觉和医疗健康领域的三个数据集上进行了广泛实验:IMDB-WIKI-DIR、AgeDB-DIR和ECG-Ka-DIR。实验结果表明,Dist Loss能有效缓解不平衡数据分布对模型性能的负面影响,在稀疏数据区域取得了最先进的性能。此外,Dist Loss易于集成,可与现有方法形成有效互补。