Stress-Testing Long-Context Language Models with Lifelong ICL and Task Haystack

We introduce Lifelong ICL, a problem setting that challenges long-context language models (LMs) to learn from a sequence of language tasks through in-context learning (ICL). We further introduce Task Haystack, an evaluation suite dedicated to assessing and diagnosing how long-context LMs utilizes contexts in Lifelong ICL. When given a task instruction and test inputs, long-context LMs are expected to leverage the relevant demonstrations in the Lifelong ICL prompt, avoid distraction and interference from other tasks, and achieve test accuracies that are not significantly worse than the Single-task ICL baseline. Task Haystack draws inspiration from the widely-adopted "needle-in-a-haystack" (NIAH) evaluation, but presents new and unique challenges. It demands that models (1) utilize the contexts with deeper understanding, rather than resorting to simple copying and pasting; (2) navigate through long streams of evolving topics and tasks, which closely approximates the complexities of real-world usage of long-context LMs. Additionally, Task Haystack inherits the controllability aspect of NIAH, providing model developers with tools and visualizations to identify model vulnerabilities effectively. We benchmark 12 long-context LMs using Task Haystack. We find that state-of-the-art closed models such as GPT-4o still struggle in this setting, failing 15% of the cases on average, while all open-weight models we evaluate further lack behind by a large margin, failing up to 61% of the cases. In our controlled analysis, we identify factors such as distraction and recency bias as contributors to these failure cases. Further, we observe declines in performance when task instructions are paraphrased at test time or when ICL demonstrations are repeated excessively, raising concerns about the robustness, instruction understanding, and true context utilization of current long-context LMs.

翻译：我们提出了终身情境学习（Lifelong ICL）这一新的问题设定，旨在挑战长上下文语言模型通过情境学习从一系列语言任务中持续学习的能力。为进一步评估和诊断长上下文语言模型在终身ICL中如何利用上下文信息，我们构建了任务干草堆（Task Haystack）评估套件。当给定任务指令和测试输入时，长上下文语言模型需要利用终身ICL提示中的相关示例，避免其他任务的干扰与影响，并达到与单任务ICL基线相比无显著差异的测试准确率。任务干草堆的设计灵感来源于广泛采用的“大海捞针”评估范式，但提出了全新且独特的挑战：它要求模型（1）通过深度理解而非简单复制粘贴来利用上下文；（2）在持续演变的话题与任务流中进行有效导航，这更贴近长上下文语言模型在真实应用场景中的复杂性。此外，任务干草堆继承了“大海捞针”范式的可控性优势，为模型开发者提供了有效识别模型脆弱性的工具与可视化手段。我们使用任务干草堆对12个长上下文语言模型进行了基准测试。研究发现，即使如GPT-4o这样的先进闭源模型在此设定下仍面临困难，平均失败率达到15%；而所有参与评估的开源模型表现差距更为显著，最高失败率可达61%。通过受控分析，我们识别出注意力分散与近因偏差等因素是导致这些失败案例的重要原因。进一步实验表明，当测试时对任务指令进行转述或过度重复ICL示例时，模型性能会出现明显下降，这引发了关于当前长上下文语言模型在鲁棒性、指令理解与真实上下文利用能力方面的担忧。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日