Neural networks can learn spurious correlations in the data, often leading to performance disparity for underrepresented subgroups. Studies have demonstrated that the disparity is amplified when knowledge is distilled from a complex teacher model to a relatively "simple" student model. Prior work has shown that ensemble deep learning methods can improve the performance of the worst-case subgroups; however, it is unclear if this advantage carries over when distilling knowledge from an ensemble of teachers, especially when the teacher models are debiased. This study demonstrates that traditional ensemble knowledge distillation can significantly drop the performance of the worst-case subgroups in the distilled student model even when the teacher models are debiased. To overcome this, we propose Adaptive Group Robust Ensemble Knowledge Distillation (AGRE-KD), a simple ensembling strategy to ensure that the student model receives knowledge beneficial for unknown underrepresented subgroups. Leveraging an additional biased model, our method selectively chooses teachers whose knowledge would better improve the worst-performing subgroups by upweighting the teachers with gradient directions deviating from the biased model. Our experiments on several datasets demonstrate the superiority of the proposed ensemble distillation technique and show that it can even outperform classic model ensembles based on majority voting.
翻译:神经网络可能学习数据中的伪相关性,这通常导致对代表性不足子群体的性能差异。研究表明,当知识从复杂的教师模型蒸馏到相对"简单"的学生模型时,这种差异会被放大。先前工作表明集成深度学习方法可以改善最差情况子群体的性能;然而,当从教师模型集成中蒸馏知识时,特别是当教师模型已进行去偏处理时,尚不清楚这种优势是否得以保持。本研究证明,即使教师模型已进行去偏处理,传统集成知识蒸馏仍可能显著降低蒸馏后学生模型中最差情况子群体的性能。为克服此问题,我们提出自适应群体鲁棒集成知识蒸馏(AGRE-KD),这是一种简单的集成策略,旨在确保学生模型获得对未知代表性不足子群体有益的知识。通过利用一个额外的有偏模型,我们的方法通过提升梯度方向偏离有偏模型的教师权重,选择性地筛选能更好改善最差表现子群体知识的教师。我们在多个数据集上的实验证明了所提集成蒸馏技术的优越性,并显示其甚至能超越基于多数投票的经典模型集成方法。