We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the contexts with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world usage of long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively. We benchmark 12 long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open-weight models we evaluate further lack behind by a large margin, failing up to 61% of the cases. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, we observe declines in performance when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of current long-context LMs.
翻译:我们提出了终身情境学习(Lifelong ICL)这一新的问题设定,旨在挑战长上下文语言模型通过情境学习从一系列语言任务中持续学习的能力。为进一步评估和诊断长上下文语言模型在终身ICL中如何利用上下文信息,我们构建了任务干草堆(Task Haystack)评估套件。当给定任务指令和测试输入时,长上下文语言模型需要利用终身ICL提示中的相关示例,避免其他任务的干扰与影响,并达到与单任务ICL基线相比无显著差异的测试准确率。任务干草堆的设计灵感来源于广泛采用的“大海捞针”评估范式,但提出了全新且独特的挑战:它要求模型(1)通过深度理解而非简单复制粘贴来利用上下文;(2)在持续演变的话题与任务流中进行有效导航,这更贴近长上下文语言模型在真实应用场景中的复杂性。此外,任务干草堆继承了“大海捞针”范式的可控性优势,为模型开发者提供了有效识别模型脆弱性的工具与可视化手段。我们使用任务干草堆对12个长上下文语言模型进行了基准测试。研究发现,即使如GPT-4o这样的先进闭源模型在此设定下仍面临困难,平均失败率达到15%;而所有参与评估的开源模型表现差距更为显著,最高失败率可达61%。通过受控分析,我们识别出注意力分散与近因偏差等因素是导致这些失败案例的重要原因。进一步实验表明,当测试时对任务指令进行转述或过度重复ICL示例时,模型性能会出现明显下降,这引发了关于当前长上下文语言模型在鲁棒性、指令理解与真实上下文利用能力方面的担忧。