Retrieval-Augmented Generation (RAG) supplements a language model's input with retrieved documents, yet most RAG pipelines inherit retrieval components designed for human readers. How retrieved content should be represented when the consumer is a large language model (LLM) rather than a human is less well understood. Recent work has proposed transformations of retrieved content and identified properties that affect generation, but each examines a single transformation or property in isolation, leaving open which features of a document's representation matter most. We address this with a controlled comparison: holding retrieval fixed, we vary only the representation of retrieved documents, comparing an original baseline against thirteen transformations spanning selection, summarisation, and reformulation, in query-dependent and query-independent variants. Across these fourteen representations we measure question-answering accuracy for four generators, and for each representation we also measure answer retention: whether a known answer-bearing document still supports its answer after transformation. We find that answer retention is the primary determinant of generator accuracy; notably, when retention is high, a representation's wording, structure, length, and query-dependence have limited effect. This suggests that accuracy gains attributed to specific mechanisms in prior work may be partly explained by how well those mechanisms preserve answer-bearing content, an attribution that cannot be settled without controlling for retention.
翻译:检索增强生成(RAG)通过检索文档补充语言模型的输入,但大多数RAG流水线沿用了为人类读者设计的检索组件。当消费者是大型语言模型(LLM)而非人类时,检索内容应如何表示尚不明确。近期研究提出了检索内容的转换方法,并识别出影响生成质量的属性,但每项研究仅独立考察单一转换或属性,未阐明文档表示中哪些特征最为关键。我们通过受控对比解决这一问题:在固定检索环节的前提下,仅改变检索文档的表示方式,将原始基准与涵盖选择、摘要和改写三种类型的十三个转换变体(包括查询相关与查询无关两类)进行比较。针对这十四种表示,我们测量了四个生成器在问答任务中的准确率,并评估了每个表示的答案保留能力——即已知包含答案的文档在转换后是否仍能支撑其原有答案。研究发现,答案保留能力是生成器准确率的主要决定因素;值得注意的是,当保留能力较高时,表示的措辞、结构、长度及查询相关性影响有限。这表明,先前研究中归因于特定机制的准确率提升,可能部分源于这些机制对答案保留能力的影响——若不控制保留能力,无法明确区分该归因关系。