While many contemporary large language models (LLMs) can process lengthy input, they still struggle to fully utilize information within the long context, known as the lost-in-the-middle challenge. We hypothesize that it stems from insufficient explicit supervision during the long-context training, which fails to emphasize that any position in a long context can hold crucial information. Based on this intuition, our study presents information-intensive (IN2) training, a purely data-driven solution to overcome lost-in-the-middle. Specifically, IN2 training leverages a synthesized long-context question-answer dataset, where the answer requires (1) fine-grained information awareness on a short segment (~128 tokens) within a synthesized long context (4K-32K tokens), and (2) the integration and reasoning of information from two or more short segments. Through applying this information-intensive training on Mistral-7B, we present FILM-7B (FILl-in-the-Middle). To thoroughly assess the ability of FILM-7B for utilizing long contexts, we design three probing tasks that encompass various context styles (document, code, and structured-data context) and information retrieval patterns (forward, backward, and bi-directional retrieval). The probing results demonstrate that FILM-7B can robustly retrieve information from different positions in its 32K context window. Beyond these probing tasks, FILM-7B significantly improves the performance on real-world long-context tasks (e.g., 23.5->26.9 F1 score on NarrativeQA), while maintaining a comparable performance on short-context tasks (e.g., 59.3->59.2 accuracy on MMLU). Github Link: https://github.com/microsoft/FILM.
翻译:尽管许多现代大型语言模型能够处理长文本输入,但它们在充分利用长上下文中的信息方面仍存在困难,这一现象被称为"中间丢失"挑战。我们假设其根源在于长上下文训练过程中缺乏显式监督,导致模型未能充分意识到长上下文中的任意位置都可能包含关键信息。基于这一直觉,本研究提出信息密集型训练——一种纯粹数据驱动的解决方案来克服"中间丢失"问题。具体而言,IN2训练利用合成长上下文问答数据集,其中答案要求:(1)在合成长上下文(4K-32K标记)中对短片段(约128标记)进行细粒度信息感知;(2)整合并推理两个或更多短片段的信息。通过在Mistral-7B上应用这种信息密集型训练,我们提出了FILM-7B。为全面评估FILM-7B利用长上下文的能力,我们设计了三个涵盖不同上下文类型(文档、代码和结构化数据上下文)及信息检索模式(前向、后向和双向检索)的探测任务。探测结果表明,FILM-7B能够在其32K上下文窗口中的不同位置稳健地检索信息。除这些探测任务外,FILM-7B显著提升了真实长上下文任务性能(例如NarrativeQA上的F1得分从23.5提升至26.9),同时保持短上下文任务性能相当(例如MMLU上准确率从59.3%降至59.2%)。GitHub地址:https://github.com/microsoft/FILM。