We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs' long context capabilities would not be possible without knowing the tasks' focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by $\lambda$ (complexity) and $k$ (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the $\lambda$ and $k$ for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by $\lambda$ and $k$ for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.
翻译:我们认为,长上下文理解存在两种主要且截然不同的能力:检索能力与整体理解能力。若不明确任务所属的聚焦类别,便无法理解并进一步改进大型语言模型(LLM)的长上下文能力。我们的目标是从一系列基准测试中自动识别出以检索为中心和以整体理解为中心的问题,并定量衡量每个聚焦类别内问题的难度。本文提出Dolce框架,该框架通过参数 $\lambda$(复杂度)和 $k$(冗余度)对每个问题进行参数化,并将其归类至五个预定义的聚焦类别之一。我们提出从完整上下文中采样短上下文片段,并估计LLM利用这些采样片段解决问题的概率。为了确定每个问题的 $\lambda$ 和 $k$,我们进一步提出一个混合模型,该模型包含一个非参数背景噪声组件和一个参数/非参数混合的预言机组件。我们推导了在“正确或错误”(COW)场景和“部分评分”(PIG)场景下,由 $\lambda$ 和 $k$ 参数化的概率函数。我们提出的方法能够在44个现有长上下文评估任务中,识别出0%至67%的问题以检索为中心,以及0%至90%的问题以整体理解为中心。