In-context learning (ICL) has become the default method for using large language models (LLMs), making the exploration of its limitations and understanding the underlying causes crucial. In this paper, we find that ICL falls short of handling specification-heavy tasks, which are tasks with complicated and extensive task specifications, requiring several hours for ordinary humans to master, such as traditional information extraction tasks. The performance of ICL on these tasks mostly cannot reach half of the state-of-the-art results. To explore the reasons behind this failure, we conduct comprehensive experiments on 18 specification-heavy tasks with various LLMs and identify three primary reasons: inability to specifically understand context, misalignment in task schema comprehension with humans, and inadequate long-text understanding ability. Furthermore, we demonstrate that through fine-tuning, LLMs can achieve decent performance on these tasks, indicating that the failure of ICL is not an inherent flaw of LLMs, but rather a drawback of existing alignment methods that renders LLMs incapable of handling complicated specification-heavy tasks via ICL. To substantiate this, we perform dedicated instruction tuning on LLMs for these tasks and observe a notable improvement. We hope the analyses in this paper could facilitate advancements in alignment methods enabling LLMs to meet more sophisticated human demands.
翻译:摘要:上下文学习已成为使用大语言模型的默认方法,因此探究其局限性并理解根本原因至关重要。本文发现,上下文学习在处理规范密集型任务时存在不足——这类任务包含复杂且冗长的任务规范,普通人需数小时才能掌握,例如传统信息抽取任务。在这些任务上,上下文学习的表现大多无法达到最优结果的一半。为探究失效原因,我们在18项规范密集型任务上使用多种大语言模型开展全面实验,识别出三个主要原因:无法准确理解上下文语境、任务模式理解与人类认知存在偏差、长文本理解能力不足。进一步研究表明,通过微调,大语言模型能在这些任务上取得可观表现,这说明上下文学习的失效并非模型固有缺陷,而是现有对齐方法的弊端导致模型无法通过上下文学习处理复杂规范密集型任务。为验证这一观点,我们针对这些任务对大语言模型进行专用指令微调,观察到显著性能提升。希望本文的分析能推动对齐方法的发展,使大语言模型能够满足更复杂的人类需求。