LLMs increasingly excel on AI benchmarks, but doing so does not guarantee validity for downstream tasks. This study contrasts LLM alignment on benchmarks, downstream tasks, and, importantly the intended impact of those tasks. We evaluate the performance of leading LLMs (i.e., generative pre-trained base models) on difficult-to-verify tasks of the teaching and learning of schoolchildren. Across all LLMs, inter-model behaviors on disparate tasks correlate higher than they do with expert human behaviors on target tasks. These biases shared across LLMs are poorly aligned with downstream measures of teaching quality and often negatively aligned with the intended impact of student learning outcomes. Further, we find multi-model ensembles, both unanimous model voting and expert-weighting by benchmark performance, further exacerbate misalignment with learning. We measure that selection of LLM and/or prompting strategy only reliably accounts for $15\%$ of all measured misalignment error and that variation in misalignment error is shared across LLMs, suggesting that common pretraining accounts for much of the misalignment in these tasks. We demonstrate methods for robustly measuring alignment of complex tasks and provide unique insights into practical applications of LLMs in high-noise contexts.
翻译:大语言模型在人工智能基准测试中表现日益出色,但这并不保证其对下游任务的有效性。本研究对比了模型在基准测试、下游任务以及这些任务预期影响三个层面的对齐程度。我们评估了主流大语言模型(即生成式预训练基础模型)在学龄儿童教学中难以验证任务上的表现。在所有模型中,不同任务间模型之间的行为关联度高于它们与目标任务中人类专家行为的相关性。这些大语言模型共有的偏差与教学质量的下游测量指标对齐不佳,且常与学生学业成果这一预期影响呈负相关。进一步发现,多模型集成(包括全模型投票和基于基准测试表现的专家权重法)会加剧与学习目标的对齐偏差。我们测算发现,大语言模型的选择和/或提示策略仅能可靠解释所有测量对齐偏差的15%,而不同模型间对齐偏差的变异具有共享性,这表明通用预训练是造成这些任务中对齐偏差的主要原因。我们展示了稳健测量复杂任务对齐度的方法,并为大语言模型在高噪声场景中的实际应用提供了独特见解。