While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.
翻译:尽管许多当代大型语言模型(LLMs)能够处理长文本输入,但它们仍难以充分利用长上下文中的信息,这被称为“中间丢失”挑战。我们假设,这源于长上下文训练过程中显式监督不足,未能强调长上下文中任意位置都可能包含关键信息。基于这一直觉,我们提出了信息密集型(IN2)训练,这是一种纯数据驱动的解决方案,旨在克服中间丢失问题。具体而言,IN2训练利用合成的长上下文问答数据集,其中答案需要:(1)对合成长上下文(4K-32K tokens)内短片段(约128 tokens)的细粒度信息感知;(2)从两个或更多短片段中整合与推理信息。通过将这种信息密集型训练应用于Mistral-7B,我们推出了FILM-7B(填补中间)。为全面评估FILM-7B利用长上下文的能力,我们设计了三个探测任务,涵盖多种上下文风格(文档、代码和结构化数据上下文)以及信息检索模式(前向、后向和双向检索)。探测结果表明,FILM-7B能够稳健地从其32K上下文窗口中的不同位置检索信息。除了这些探测任务外,FILM-7B在真实世界的长上下文任务上显著提升了性能(例如,NarrativeQA上的F1分数从23.5提升到26.9),同时在短上下文任务上保持了可比性能(例如,MMLU上的准确率从59.3变为59.2)。GitHub链接:https://github.com/microsoft/FILM。