Automatic related work generation must ground their outputs to the content of the cited papers to avoid non-factual hallucinations, but due to the length of scientific documents, existing abstractive approaches have conditioned only on the cited paper \textit{abstracts}. We demonstrate that the abstract is not always the most appropriate input for citation generation and that models trained in this way learn to hallucinate. We propose to condition instead on the \textit{cited text span} (CTS) as an alternative to the abstract. Because manual CTS annotation is extremely time- and labor-intensive, we experiment with automatic, ROUGE-based labeling of candidate CTS sentences, achieving sufficiently strong performance to substitute for expensive human annotations, and we propose a human-in-the-loop, keyword-based CTS retrieval approach that makes generating citation texts grounded in the full text of cited papers both promising and practical.
翻译:自动相关工作生成必须将其输出锚定在被引论文的内容中,以避免非事实性幻觉,但由于科学文档篇幅较长,现有的抽象式方法仅依赖被引论文的摘要进行条件生成。我们证明摘要并非总是最适用于引文生成,且以此方式训练的模型会学习产生幻觉。我们提出将条件生成转向引用文本跨度(CTS)作为摘要的替代方案。由于人工标注CTS极其耗时耗力,我们实验了基于ROUGE的自动候选CTS句子标注方法,其性能足以替代昂贵的人工标注,并提出了人机协同的基于关键词的CTS检索方法,使得基于被引论文全文生成引文文本既具有前景又具备可操作性。