We investigate the challenge of applying Large Language Models (LLMs) to long texts. We propose a theoretical framework that distinguishes the failure modes of long context tasks into three categories: cross-chunk dependence (task noise), confusion that grows with context size (model noise), and the imperfect integration of partial results (aggregator noise). Under this view, we analyze when it is effective to use multi-agent chunking, i.e., dividing a lengthy sequence into smaller chunks and aggregating the processed results of each chunk. Our experiments on tasks such as retrieval, question answering, and summarization confirm both the theoretical analysis and the conditions that favor multi-agent chunking. By exploring the accelerated decay of model fidelity with input length, we also explain why, for large inputs, a weaker model configured with chunk-based processing can surpass a more advanced model like GPT4o applied in a single shot. Overall, we present a principled understanding framework and our results highlight a direct pathway to handling long contexts in LLMs with carefully managed chunking and aggregator strategies.
翻译:我们研究了将大型语言模型(LLM)应用于长文本所面临的挑战。我们提出了一个理论框架,将长上下文任务的失败模式区分为三类:跨块依赖(任务噪声)、随上下文长度增长的混淆(模型噪声),以及部分结果的不完美整合(聚合器噪声)。基于这一视角,我们分析了何时采用多智能体分块策略是有效的,即将长序列划分为较小的块,并聚合每个块的处理结果。我们在检索、问答和摘要等任务上的实验验证了理论分析,并明确了有利于多智能体分块的条件。通过探究模型保真度随输入长度加速衰减的现象,我们还解释了为何对于大型输入,配置了基于分块处理的较弱模型能够超越像GPT4o这样更先进的单次处理模型。总体而言,我们提出了一个原则性的理解框架,我们的结果突显了通过精心管理的分块和聚合器策略来处理LLM长上下文的直接路径。