Addressing the class imbalance in long-tailed semi-supervised learning (SSL) poses a few significant challenges stemming from differences between the marginal distributions of unlabeled data and the labeled data, as the former is often unknown and potentially distinct from the latter. The first challenge is to avoid biasing the pseudo-labels towards an incorrect distribution, such as that of the labeled data or a balanced distribution, during training. However, we still wish to ensure a balanced unlabeled distribution during inference, which is the second challenge. To address both of these challenges, we propose a three-faceted solution: a flexible distribution alignment that progressively aligns the classifier from a dynamically estimated unlabeled prior towards a balanced distribution, a soft consistency regularization that exploits underconfident pseudo-labels discarded by threshold-based methods, and a schema for expanding the unlabeled set with input data from the labeled partition. This last facet comes in as a response to the commonly-overlooked fact that disjoint partitions of labeled and unlabeled data prevent the benefits of strong data augmentation on the labeled set. Our overall framework requires no additional training cycles, so it will align, distill, and augment everything all at once (ADALLO). Our extensive evaluations of ADALLO on imbalanced SSL benchmark datasets, including CIFAR10-LT, CIFAR100-LT, and STL10-LT with varying degrees of class imbalance, amount of labeled data, and distribution mismatch, demonstrate significant improvements in the performance of imbalanced SSL under large distribution mismatch, as well as competitiveness with state-of-the-art methods when the labeled and unlabeled data follow the same marginal distribution. Our code will be released upon paper acceptance.
翻译:长尾半监督学习(SSL)中的类别不平衡问题带来了若干重大挑战,其根源在于未标注数据与标注数据的边缘分布存在差异——前者通常未知且可能与后者显著不同。首要挑战是训练过程中避免将伪标签偏向错误分布(如标注数据分布或平衡分布)。然而,我们仍需确保推理时未标注数据呈现平衡分布,这构成第二大挑战。针对上述挑战,我们提出三管齐下的解决方案:一种灵活的分布对齐策略,可将分类器从动态估计的未标注先验分布逐步引导至平衡分布;一种软一致性正则化方法,可充分利用被阈值方法舍弃的低置信度伪标签;以及一种通过扩展标注分区输入数据来扩充未标注集的模式。最后这一策略针对常被忽视的事实:标注数据与未标注数据的分区隔离会阻碍强数据增强对标注集产生效益。我们的整体框架无需额外训练轮次,能够实现一切对齐、蒸馏与增强的同步进行(ADALLO)。在CIFAR10-LT、CIFAR100-LT和STL10-LT等不平衡SSL基准数据集上的全面评估表明,即使在不同的类别不平衡程度、标注数据量及分布失配情况下,ADALLO均能显著提升分布严重失配时的SSL性能,并在标注与未标注数据遵循相同边缘分布时与现有最优方法保持竞争力。相关代码将在论文接收后开源。