LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study evaluates the performance of leading foundation models (FMs, i.e., generative pre-trained base LLMs) with out-of-distribution (OOD) tasks of the teaching and learning of schoolchildren. Across all FMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often \textit{negatively aligned with learning outcomes}. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that 50\% of the variation in misalignment error is shared across foundation models, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into both educational applications of foundation models and to understanding limitations of models.
翻译:大语言模型在人工智能基准测试中表现日益优异,但这并不能保证其在下游任务中的有效性。本研究评估了主流基础模型(即生成式预训练基础大语言模型)在学龄儿童教学这一分布外任务上的表现。在所有基础模型中,不同任务间的模型行为相关性高于其与专家在目标任务上人类行为的相关性。这些大语言模型共有的偏差与教学质量的下游度量指标契合度很低,且常常与学习成果呈负向关联。此外,我们发现多模型集成策略(包括一致性模型投票和基于基准性能的专家加权)会进一步加剧与学习目标的错位。我们测得错位误差中50%的变异在不同基础模型间具有共性,这表明通用预训练是导致这些任务错位的主要原因。我们提出了稳健测量复杂任务对齐度的方法,为基础模型的教育应用及理解模型局限性提供了独特见解。