Hallucination in text summarization refers to the phenomenon where the model generates information that is not supported by the input source document. Hallucination poses significant obstacles to the accuracy and reliability of the generated summaries. In this paper, we aim to reduce hallucinated outputs or hallucinations in summaries of long-form text documents. We have used the PubMed dataset, which contains long scientific research documents and their abstracts. We have incorporated the techniques of data filtering and joint entity and summary generation (JAENS) in the fine-tuning of the Longformer Encoder-Decoder (LED) model to minimize hallucinations and thereby improve the quality of the generated summary. We have used the following metrics to measure factual consistency at the entity level: precision-source, and F1-target. Our experiments show that the fine-tuned LED model performs well in generating the paper abstract. Data filtering techniques based on some preprocessing steps reduce entity-level hallucinations in the generated summaries in terms of some of the factual consistency metrics.
翻译:文本摘要中的虚构生成是指模型生成的信息与输入源文档不符的现象。虚构生成对摘要的准确性和可靠性构成重大障碍。本文旨在减少长文本文档摘要中的虚构输出或虚构内容。我们使用包含长篇科研论文及其摘要的PubMed数据集,通过数据过滤和联合实体与摘要生成(JAENS)技术对Longformer编码器-解码器(LED)模型进行微调,以最小化虚构生成,从而提高生成摘要的质量。我们采用以下指标来衡量实体层面的事实一致性:精确度-源指标和F1-目标指标。实验表明,微调后的LED模型在生成论文摘要方面表现良好。基于预处理步骤的数据过滤技术通过部分事实一致性指标减少了生成摘要中的实体级虚构生成。