The Needle-in-a-haystack (NIAH) test is a general task used to assess language models' (LMs') abilities to recall particular information from long input context. This framework however does not provide a means of analyzing what factors, beyond context length, contribute to LMs' abilities or inabilities to separate and recall needles from their haystacks. To provide a systematic means of assessing what features contribute to LMs' NIAH capabilities, we developed a synthetic benchmark called DENIAHL (Data-oriented Evaluation of NIAH for LLM's). Our work expands on previous NIAH studies by ablating NIAH features beyond typical context length including data type, size, and patterns. We find stark differences between GPT-3.5 and LLaMA 2-7B's performance on DENIAHL, and drops in recall performance when features like item size are increased, and to some degree when data type is changed from numbers to letters. This has implications for increasingly large context models, demonstrating factors beyond item-number impact NIAH capabilities.
翻译:“大海捞针”测试是一种用于评估语言模型从长输入上下文中回忆特定信息能力的通用任务。然而,该框架并未提供一种分析方法,用以探究除上下文长度之外,还有哪些因素影响语言模型从“干草堆”中分离并回忆“针”的能力或导致其失败。为了提供一种系统性的方法来评估哪些特征影响语言模型的“大海捞针”能力,我们开发了一个名为DENIAHL(面向LLM的大海捞针数据导向评估)的合成基准测试。我们的工作扩展了以往的“大海捞针”研究,通过消融分析除典型上下文长度之外的特征,包括数据类型、数据大小和数据模式。我们发现GPT-3.5与LLaMA 2-7B在DENIAHL上的表现存在显著差异,并且当项目大小等特征增加时,回忆性能会下降,当数据类型从数字变为字母时,性能也会在一定程度上下降。这对日益增大的上下文模型具有启示意义,表明除项目数量外,其他因素也会影响“大海捞针”能力。