Retrieval-Augmented Generation (RAG) enhances factual grounding in large language models (LLMs) by incorporating retrieved evidence, but LLM accuracy declines when long or noisy contexts exceed the model's effective attention span. Existing pre-generation filters rely on heuristics or uncalibrated LLM confidence scores, offering no statistical control over retained evidence. We evaluate and demonstrate context engineering through conformal prediction, a coverage-controlled filtering framework that removes irrelevant content while preserving recall of supporting evidence. Using both embedding- and LLM-based scoring functions, we test this approach on the NeuCLIR and RAGTIME collections. Conformal filtering consistently meets its target coverage, ensuring that a specified fraction of relevant snippets are retained, and reduces retained context by 2-3x relative to unfiltered retrieval. On NeuCLIR, downstream factual accuracy measured by ARGUE F1 improves under strict filtering and remains stable at moderate coverage, indicating that most discarded material is redundant or irrelevant. These results demonstrate that conformal prediction enables reliable, coverage-controlled context reduction in RAG, offering a model-agnostic and principled approach to context engineering.
翻译:检索增强生成(RAG)通过整合检索到的证据来增强大语言模型(LLM)的事实基础,但当冗长或嘈杂的上下文超出模型的有效注意力范围时,LLM的准确性会下降。现有的预生成过滤器依赖于启发式方法或未校准的LLM置信度分数,无法对保留的证据提供统计控制。我们通过共形预测评估并演示上下文工程,这是一种覆盖度可控的过滤框架,可在保留支持性证据召回率的同时移除无关内容。使用基于嵌入和基于LLM的评分函数,我们在NeuCLIR和RAGTIME数据集上测试了该方法。共形过滤始终达到其目标覆盖度,确保保留指定比例的相关片段,并将保留的上下文相对于未过滤检索减少2-3倍。在NeuCLIR上,通过ARGUE F1度量的下游事实准确性在严格过滤下得到提升,并在中等覆盖度下保持稳定,表明大多数被丢弃的材料是冗余或无关的。这些结果表明,共形预测能够在RAG中实现可靠、覆盖度可控的上下文缩减,为上下文工程提供了一种模型无关且基于原则的方法。