Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs' instruction-following capabilities for knowledge intensive writing tasks.
翻译:大型语言模型(LLMs)经过指令微调后已广泛应用于对话代理场景。本研究探讨一项日益常见的指令遵循任务:为长篇问答提供写作辅助。为评估当前LLMs在该任务上的能力,我们构建了KIWI——一个面向科学领域的知识密集型写作指令数据集。针对给定的研究问题、模型初始生成的答案及一组相关论文,领域专家通过迭代方式发布指令,引导模型修改并改进其答案。我们从与三个最先进LLMs进行的234次交互会话中收集了1,260个交互轮次。每个轮次包含用户指令、模型响应及人类对模型响应的评估。通过对收集的响应进行详细分析,我们发现所有模型均难以将新信息融入既有答案,且无法执行精准无歧义的编辑操作。进一步地,我们观察到模型难以判断其输出是否成功遵循用户指令,其准确率与人类一致性相比至少相差10个百分点。研究结果表明,KIWI将成为测评进展、提升LLMs在知识密集型写作任务中指令遵循能力的重要资源。