We present a study of retrieval-augmented language models (LMs) on long-form question answering. We analyze how retrieval augmentation impacts different LMs, by comparing answers generated from models while using the same evidence documents, and how differing quality of retrieval document set impacts the answers generated from the same LM. We study various attributes of generated answers (e.g., fluency, length, variance) with an emphasis on the attribution of generated long-form answers to in-context evidence documents. We collect human annotations of answer attribution and evaluate methods for automatically judging attribution. Our study provides new insights on how retrieval augmentation impacts long, knowledge-rich text generation of LMs. We further identify attribution patterns for long text generation and analyze the main culprits of attribution errors. Together, our analysis reveals how retrieval augmentation impacts long knowledge-rich text generation and provide directions for future work.
翻译:我们针对检索增强型语言模型在长文本问答任务中的表现展开研究。通过使用相同证据文档对比不同模型生成的答案,我们分析了检索增强对不同语言模型的影响;同时,通过考察检索文档集质量差异对同一模型生成答案的影响,探讨了检索文档质量的作用机制。研究重点聚焦于生成答案的多维属性(如流畅度、篇幅长度、一致性),特别关注长文本答案与上下文证据文档之间的归因关系。我们采集了人工标注的答案归因数据,并对自动归因判断方法进行了评估。本研究揭示了检索增强如何影响语言模型的长文本、知识密集型文本生成过程,进一步识别了长文本生成的归因模式,剖析了导致归因错误的主因。通过系统分析,我们阐明了检索增强对长文本知识生成的影响机理,为后续研究指明了方向。