How can we train models to perform well on hard test data when hard training data is by definition difficult to label correctly? This question has been termed the scalable oversight problem and has drawn increasing attention as language models have continually improved. In this paper, we present the surprising conclusion that current language models often generalize relatively well from easy to hard data, even performing as well as "oracle" models trained on hard data. We demonstrate this kind of easy-to-hard generalization using simple training methods like in-context learning, linear classifier heads, and QLoRA for seven different measures of datapoint hardness, including six empirically diverse human hardness measures (like grade level) and one model-based measure (loss-based). Furthermore, we show that even if one cares most about model performance on hard data, it can be better to collect and train on easy data rather than hard data, since hard data is generally noisier and costlier to collect. Our experiments use open models up to 70b in size and four publicly available question-answering datasets with questions ranging in difficulty from 3rd grade science questions to college level STEM questions and general-knowledge trivia. We conclude that easy-to-hard generalization in LMs is surprisingly strong for the tasks studied, suggesting the scalable oversight problem may be easier than previously thought. Our code is available at https://github.com/allenai/easy-to-hard-generalization
翻译:如何训练模型使其在困难测试数据上表现良好,而困难训练数据本身因定义而难以正确标注?这一问题被称为可扩展监督问题,并且随着语言模型的持续改进而日益受到关注。本文得出了一个令人惊讶的结论:当前语言模型通常能够从容易数据相对良好地泛化到困难数据,其性能甚至可与在困难数据上训练的“预言机”模型相媲美。我们通过使用简单的训练方法(如上下文学习、线性分类头以及针对七种数据点难度指标的QLoRA)展示了这种从容易到困难的泛化能力,这七种指标包括六种基于经验的人类难度度量(如年级水平)和一种基于模型的度量(基于损失)。此外,我们表明,即使最关心模型在困难数据上的性能,收集和训练容易数据也可能比困难数据更优,因为困难数据通常噪声更大、收集成本更高。我们的实验使用了规模高达70b的开源模型,以及四个公开的问答数据集,这些数据集的问题难度从三年级科学问题延伸到大学水平的STEM问题及常识问答。我们得出结论,对于所研究的任务,语言模型从容易到困难的泛化能力出奇地强大,这表明可扩展监督问题可能比之前认为的更容易解决。我们的代码见 https://github.com/allenai/easy-to-hard-generalization