We present PeerSum, a novel dataset for generating meta-reviews of scientific papers. The meta-reviews can be interpreted as abstractive summaries of reviews, multi-turn discussions and the paper abstract. These source documents have rich inter-document relationships with an explicit hierarchical conversational structure, cross-references and (occasionally) conflicting information. To introduce the structural inductive bias into pre-trained language models, we introduce Rammer ( Relationship-aware Multi-task Meta-review Generator), a model that uses sparse attention based on the conversational structure and a multi-task training objective that predicts metadata features (e.g., review ratings). Our experimental results show that Rammer outperforms other strong baseline models in terms of a suite of automatic evaluation metrics. Further analyses, however, reveal that RAMMER and other models struggle to handle conflicts in source documents of PeerSum, suggesting meta-review generation is a challenging task and a promising avenue for further research.
翻译:我们提出了PeerSum,这是一个用于生成科学论文元评审的新型数据集。元评审可被理解为对评审意见、多轮讨论以及论文摘要的抽象性总结。这些源文档具有丰富的文档间关系,包括显式的层次化对话结构、交叉引用以及(偶尔出现的)冲突信息。为了将结构归纳偏置引入预训练语言模型,我们提出了Rammer(关系感知多任务元评审生成器),该模型基于对话结构使用稀疏注意力机制,并通过多任务训练目标预测元数据特征(如评审评分)。实验结果表明,在多项自动评估指标上,Rammer优于其他强基线模型。然而进一步分析显示,Rammer及其他模型在处理PeerSum源文档中的冲突信息时仍存在困难,这表明元评审生成是一项具有挑战性的任务,也是值得深入研究的可行方向。