Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today's models.
翻译:许多实际应用(如笔记整理、信息检索)需要从文档中提取句子或段落,并将其展示给脱离原始文档的读者。然而,用户可能因缺乏原始文档的上下文而难以理解这些片段。本研究利用语言模型对科学文档的片段进行重写,使其能够独立阅读。首先,我们定义了面向用户的去语境化任务的需求与挑战,例如明确编辑发生的位置、处理对其他文档的引用等。其次,我们提出一个将任务分解为三个阶段的框架:问题生成、问答与重写。基于该框架,我们邀请了经验丰富的科学文献读者收集人工去语境化样本。随后,我们在当前最先进的商用及开源语言模型上开展了一系列实验,以确定如何为模型提供缺失但相关的信息。最后,我们提出了QaDecontext——一种受框架启发、优于端到端提示的简洁提示策略。通过分析发现,尽管重写任务相对容易,但问题生成与回答对现有模型而言仍具挑战性。