Long-context question answering (QA) over lengthy documents is critical for applications such as financial analysis, legal review, and scientific research. Current approaches, such as processing entire documents via a single LLM call or retrieving relevant chunks via RAG have two drawbacks: First, as context size increases, response quality can degrade, impacting accuracy. Second, iteratively processing hundreds of input documents can incur prohibitively high costs in API calls. To improve response quality and reduce the number of iterations needed to get the desired response, users tend to add domain knowledge to their prompts. However, existing systems fail to systematically capture and use this knowledge to guide query processing. Domain knowledge is treated as prompt tokens alongside the document: the LLM may or may not follow it, there is no reduction in computational cost, and when outputs are incorrect, users must manually iterate. We present Halo, a long-context QA framework that automatically extracts domain knowledge from user prompts and applies it as executable operators across a multi-stage query execution pipeline. Halo identifies three common forms of domain knowledge - where in the document to look, what content to ignore, and how to verify the answer - and applies each at the pipeline stage where it is most effective: pruning the document before chunk selection, filtering irrelevant chunks before inference, and ranking candidate responses after generation. To handle imprecise or invalid domain knowledge, Halo includes a fallback mechanism that detects low-quality operators at runtime and selectively disables them. Our evaluation across finance, literature, and scientific datasets shows that Halo achieves up to 13% higher accuracy and 4.8x lower cost compared to baselines, and enables a lightweight open-source model to approach frontier LLM accuracy at 78x lower cost.
翻译:长上下文问答对财务分析、法律审查和科学研究等应用至关重要。当前的方法(如通过单次大语言模型调用处理整篇文档,或通过RAG检索相关文本块)存在两大缺陷:首先,随着上下文规模增大,响应质量可能下降,影响准确性;其次,迭代处理数百篇输入文档会导致API调用成本过高。为提升响应质量并减少获得理想结果所需的迭代次数,用户倾向于在提示词中注入领域知识。然而现有系统无法系统性地捕获和利用这些知识来指导查询处理。领域知识仅作为提示词令牌与文档一同输入:大语言模型可能遵循也可能忽略这些知识,计算成本不会降低,且当输出错误时用户必须手动迭代。我们提出Halo——一种长上下文问答框架,它能自动从用户提示词中提取领域知识,并将其转化为可执行算子,应用于多阶段查询执行管道。Halo识别三类常见的领域知识形式(文档中何处查找、忽略哪些内容、如何验证答案),并在管道中最有效的阶段分别应用:在选择文本块前修剪文档、在推理前过滤无关文本块、在生成后对候选答案排序。为应对不精确或无效的领域知识,Halo包含回退机制,可在运行时检测低质量算子并选择性禁用。我们在金融、文献和科学数据集上的评估表明,与基线方法相比,Halo的准确率提升高达13%,成本降低至4.8倍,并使轻量级开源模型能以78倍低成本接近前沿大语言模型的准确率。