Confusion-Aware Spectral Regularizer for Long-Tailed Recognition

Ziquan Zhu,Gaojie Jin,Hanruo Zhu,Si-Yuan Lu,Yunxiao Zhang,Zeyu Fu,Ronghui Mu,Guoqiang Zhang,Zhao Sun,Xia Yuhang,Jiaxing Shang,Xiang Li,Lu Liu,Tianjin Huang

Long-tailed image classification remains a long-standing challenge, as real-world data typically follow highly imbalanced distributions where a few head classes dominate and many tail classes contain only limited samples. This imbalance biases feature learning toward head categories and leads to significant degradation on rare classes. Although recent studies have proposed re-sampling, re-weighting, and decoupled learning strategies, the improvement on the most underrepresented classes still remains marginal compared with overall accuracy. In this work, we present a confusion-centric perspective for long-tailed recognition that explicitly focuses on worst-class generalization. We first establish a new theoretical framework of class-specific error analysis, which shows that the worst-class error can be tightly upper-bounded by the spectral norm of the frequency-weighted confusion matrix and a model-dependent complexity term. Guided by this insight, we propose the Confusion-Aware Spectral Regularizer (CAR) that minimizes the spectral norm of the confusion matrix during training to reduce inter-class confusion and enhance tail-class generalization. To enable stable and efficient optimization, CAR integrates a Differentiable Confusion Matrix Surrogate and an EMA-based Confusion Estimator to maintain smooth and low-variance estimates across mini-batches. Extensive experiments across multiple long-tailed benchmarks demonstrates that CAR substantially improves both worst-class accuracy and overall performance. When combined with ConCutMix augmentation, CAR consistently surpasses exisiting state-of-the-art long-tailed learning methods under both the training-from-scratch setting (by 2.37% ~ 4.83%) and the fine-tuning-from-pretrained setting (by 2.42% ~ 4.17%) across ImageNet-LT, CIFAR100-LT, and iNaturalist datasets.

翻译：长尾图像分类仍然是一个长期存在的挑战，因为现实世界的数据通常遵循高度不平衡的分布，其中少数头部类别占据主导地位，而许多尾部类别仅包含有限的样本。这种不平衡使得特征学习偏向于头部类别，并导致稀有类别的性能显著下降。尽管最近的研究提出了重采样、重加权和解耦学习策略，但与整体准确率相比，对最代表性不足类别的改进仍然有限。在这项工作中，我们提出了一个以混淆为中心的长尾识别视角，明确关注最差类别的泛化能力。我们首先建立了一个新的类别特定误差分析理论框架，该框架表明最差类别误差可以被频率加权混淆矩阵的谱范数和一个模型相关的复杂度项紧密上界。基于这一见解，我们提出了混淆感知谱正则化器（CAR），它在训练过程中最小化混淆矩阵的谱范数，以减少类间混淆并增强尾部类别的泛化能力。为了实现稳定高效的优化，CAR集成了一个可微混淆矩阵替代项和一个基于指数移动平均（EMA）的混淆估计器，以在整个小批量训练中保持平滑且低方差的估计。在多个长尾基准数据集上的大量实验表明，CAR显著提高了最差类别准确率和整体性能。当与ConCutMix数据增强结合使用时，在ImageNet-LT、CIFAR100-LT和iNaturalist数据集上，无论是从零开始训练（提升2.37% ~ 4.83%）还是从预训练模型微调（提升2.42% ~ 4.17%）的设置下，CAR均持续超越了现有的最先进长尾学习方法。