In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information ("margins") that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and more than a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at https://github.com/writer/writing-in-the-margins.
翻译:本文提出"边缘书写"(Writing in the Margins, WiM),一种专为大型语言模型设计的新型推理模式,旨在优化检索导向任务中对长输入序列的处理。该方法利用键值缓存的分块预填充机制执行分段式推理,从而实现对扩展上下文的高效处理,同时生成并分类用于引导模型执行特定任务的中间信息(即"边缘")。该方法在仅略微增加计算开销的情况下,显著提升了现成模型的性能,且无需微调。具体而言,我们观察到WiM在推理能力任务(HotpotQA、MultiHop-RAG)上平均准确率提升7.5%,在聚合任务(CWE)上F1分数提升超过30.0%。此外,我们展示了该模式如何融入交互式检索设计,持续向终端用户更新上下文处理进度,并精准定位相关信息在最终响应中的整合位置。我们基于Hugging Face Transformers库的WiM实现已发布于https://github.com/writer/writing-in-the-margins。