In the context of the long-tail scenario, models exhibit a strong demand for high-quality data. Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance. Among these approaches, information augmentation has been progressively introduced as a crucial category. It achieves a balance in model performance by augmenting the richness and quantity of samples in the tail classes. However, there is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation methods. Consequently, the utilization of information augmentation in long-tail recognition tasks relies heavily on empirical and intricate fine-tuning. This work makes two primary contributions. Firstly, we approach the problem from the perspectives of feature diversity and distribution shift, introducing the concept of Feature Diversity Gain (FDG) to elucidate why information augmentation is effective. We find that the performance of information augmentation can be explained by FDG, and its performance peaks when FDG achieves an appropriate balance. Experimental results demonstrate that by using FDG to select augmented data, we can further enhance model performance without the need for any modifications to the model's architecture. Thus, data-centric approaches hold significant potential in the field of long-tail recognition, beyond the development of new model structures. Furthermore, we systematically introduce the core components and fundamental tasks of a data-centric long-tail learning framework for the first time. These core components guide the implementation and deployment of the system, while the corresponding fundamental tasks refine and expand the research area.
翻译:在长尾场景下,模型对高质量数据有强烈需求。以数据为中心的方法旨在通过提升数据的数量与质量来改善模型性能。其中,信息增强作为重要分支逐渐被引入,通过增加尾部类样本的丰富性与数量来平衡模型性能。然而,目前缺乏对信息增强方法有效性的内在机制研究,导致其在长尾识别任务中的应用严重依赖经验性复杂调参。本工作主要有两点贡献:首先,我们从特征多样性与分布偏移视角切入,引入特征多样性增益(Feature Diversity Gain,FDG)概念以解释信息增强的有效性。研究发现信息增强的性能可由FDG解释,当FDG达到适当平衡时其性能达到峰值。实验结果表明,通过FDG选择增强数据,可在不修改模型架构的前提下进一步提升模型性能——这表明以数据为中心的方法在长尾识别领域具备超越新型模型结构开发的巨大潜力。此外,我们首次系统性地提出以数据为中心的长尾学习框架核心组件与基础任务。其中,核心组件指导系统实现与部署,对应基础任务则细化并拓展该研究方向。