Steering the behavior of a strong model pre-trained on internet-scale data can be difficult due to the scarcity of competent supervisors. Recent studies reveal that, despite supervisory noises, a strong student model may surpass its weak teacher when fine-tuned on specific objectives. Yet, the effectiveness of such weak-to-strong generalization remains limited, especially in the presence of large capability gaps. In this paper, we propose to address this challenge by harnessing a diverse set of specialized teachers, instead of a single generalist one, that collectively supervises the strong student. Our approach resembles the classical hierarchical mixture of experts, with two components tailored for co-supervision: (i) we progressively alternate student training and teacher assignment, leveraging the growth of the strong student to identify plausible supervisions; (ii) we conservatively enforce teacher-student and local-global consistency, leveraging their dependencies to reject potential annotation noises. We validate the proposed method through visual recognition tasks on the OpenAI weak-to-strong benchmark and additional multi-domain datasets. Our code is available at \url{https://github.com/yuejiangliu/csl}.
翻译:在互联网规模数据上预训练的强模型行为调控因缺乏能力合格的监督者而面临挑战。近期研究表明,尽管存在监督噪声,强学生模型在针对特定目标微调后仍可能超越其弱教师模型。然而,这种弱到强泛化机制的有效性仍受限于能力差距过大的场景。本文提出通过利用多样化专业教师集群(而非单一泛化型教师)来共同监督强学生模型。该方法近似经典层次化专家混合架构,包含两个共监督特化组件:(i)通过交替进行学生训练与教师分配,利用强学生的能力增长识别合理监督信息;(ii)保守地实施师生一致性约束与局部-全局一致性约束,利用其依赖关系排除潜在标注噪声。我们在OpenAI弱到强基准测试及多领域数据集上的视觉识别任务中验证了所提方法的有效性。代码开源地址:\url{https://github.com/yuejiangliu/csl}