Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.
翻译:尽管大语言模型(LLMs)展现出利用上下文示例的卓越能力,但其从真实标签中学习新概念或事实的程度仍不明确。为探究此问题,我们检验了指令微调LLMs在句子标注任务中遵循上下文概念指南的能力。我们设计了包含不同事实与反事实概念定义类型的指南,并将其作为零样本句子分类任务的提示。结果表明,虽然概念定义始终有助于提升任务性能,但仅较大规模的模型(700亿参数以上)在反事实情境下展现出有限的工作能力。值得注意的是,仅有GPT-3.5和GPT-4等专有模型能识别无意义指南,我们推测这得益于更成熟的对齐方法。此外,我们发现Falcon-180B-chat在多数情况下表现不及Llama-2-70B-chat,表明精细微调比单纯扩大模型规模更为有效。总体而言,我们简单的评估方法揭示了最先进开源语言模型与领先专有API之间在概念理解上的显著差距。