Classification models are expected to perform equally well for different classes, yet in practice, there are often large gaps in their performance. This issue of class bias is widely studied in cases of datasets with sample imbalance, but is relatively overlooked in balanced datasets. In this work, we introduce the concept of spectral imbalance in features as a potential source for class disparities and study the connections between spectral imbalance and class bias in both theory and practice. To build the connection between spectral imbalance and class gap, we develop a theoretical framework for studying class disparities and derive exact expressions for the per-class error in a high-dimensional mixture model setting. We then study this phenomenon in 11 different state-of-the-art pretrained encoders and show how our proposed framework can be used to compare the quality of encoders, as well as evaluate and combine data augmentation strategies to mitigate the issue. Our work sheds light on the class-dependent effects of learning, and provides new insights into how state-of-the-art pretrained features may have unknown biases that can be diagnosed through their spectra.
翻译:分类模型预期对不同类别具有同等性能,然而实践中常存在显著性能差距。这种类别偏差问题在样本不平衡的数据集中被广泛研究,但在平衡数据集中相对被忽视。本研究提出特征频谱非均衡概念作为类别差异的潜在来源,并从理论与实证两个层面探讨频谱非均衡与类别偏差的关联。为建立频谱非均衡与类别差距的理论联系,我们构建了研究类别差异的理论框架,在高维混合模型设定中推导出每类分类误差的精确表达式。进一步在11种不同的先进预训练编码器上验证该现象,展示所提框架可用于比较编码器质量、评估数据增强策略并组合使用以缓解该问题。本研究揭示了学习过程中与类别相关的效应,为理解先进预训练特征中可能存在的隐蔽偏差及通过频谱诊断提供了新视角。