Grounded text generation, encompassing tasks such as long-form question-answering and summarization, necessitates both content selection and content consolidation. Current end-to-end methods are difficult to control and interpret due to their opaqueness. Accordingly, recent works have proposed a modular approach, with separate components for each step. Specifically, we focus on the second subtask, of generating coherent text given pre-selected content in a multi-document setting. Concretely, we formalize \textit{Fusion-in-Context} (FiC) as a standalone task, whose input consists of source texts with highlighted spans of targeted content. A model then needs to generate a coherent passage that includes all and only the target information. Our work includes the development of a curated dataset of 1000 instances in the reviews domain, alongside a novel evaluation framework for assessing the faithfulness and coverage of highlights, which strongly correlate to human judgment. Several baseline models exhibit promising outcomes and provide insightful analyses. This study lays the groundwork for further exploration of modular text generation in the multi-document setting, offering potential improvements in the quality and reliability of generated content. \footnote{Our benchmark, FuseReviews, including the dataset, evaluation framework and designated leaderboard, can be found at \url{https://fusereviews.github.io/}.}
翻译:基于真实文本的生成任务,涵盖如长文本问答与摘要生成等场景,既需要内容选择又需要内容整合。当前端到端方法因缺乏透明性而难以控制与解释。为此,近期研究提出模块化方法,针对每个步骤设计独立组件。具体而言,我们聚焦于多文档场景下基于预选内容生成连贯文本的第二个子任务。我们将《上下文融合》(Fusion-in-Context,FiC)形式化为独立任务:其输入包含源文本及标记目标内容的突出显示片段,模型需生成仅包含所有目标信息的连贯段落。本研究构建了包含1000个评论领域实例的精选数据集,并开发了全新的评估框架用于评估高亮内容的忠实度与覆盖率,该框架与人类判断高度相关。多个基线模型展现出良好效果并提供了深入分析。本研究为多文档场景下模块化文本生成的进一步探索奠定基础,有望提升生成内容的质量与可靠性。\footnote{我们的基准评测FuseReviews(包含数据集、评估框架与专项排行榜)详见\url{https://fusereviews.github.io/}。}