Make Your LLM Fully Utilize the Context

While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.

翻译：尽管许多当代大型语言模型（LLMs）能够处理长文本输入，但它们仍难以充分利用长上下文中的信息，这被称为“中间丢失”挑战。我们假设，这源于长上下文训练过程中显式监督不足，未能强调长上下文中任意位置都可能包含关键信息。基于这一直觉，我们提出了信息密集型（IN2）训练，这是一种纯数据驱动的解决方案，旨在克服中间丢失问题。具体而言，IN2训练利用合成的长上下文问答数据集，其中答案需要：（1）对合成长上下文（4K-32K tokens）内短片段（约128 tokens）的细粒度信息感知；（2）从两个或更多短片段中整合与推理信息。通过将这种信息密集型训练应用于Mistral-7B，我们推出了FILM-7B（填补中间）。为全面评估FILM-7B利用长上下文的能力，我们设计了三个探测任务，涵盖多种上下文风格（文档、代码和结构化数据上下文）以及信息检索模式（前向、后向和双向检索）。探测结果表明，FILM-7B能够稳健地从其32K上下文窗口中的不同位置检索信息。除了这些探测任务外，FILM-7B在真实世界的长上下文任务上显著提升了性能（例如，NarrativeQA上的F1分数从23.5提升到26.9），同时在短上下文任务上保持了可比性能（例如，MMLU上的准确率从59.3变为59.2）。GitHub链接：https://github.com/microsoft/FILM。