Abstractive summarization has made significant strides in condensing and rephrasing large volumes of text into coherent summaries. However, summarizing administrative documents presents unique challenges due to domain-specific terminology, OCR-generated errors, and the scarcity of annotated datasets for model fine-tuning. Existing models often struggle to adapt to the intricate structure and specialized content of such documents. To address these limitations, we introduce DocSum, a domain-adaptive abstractive summarization framework tailored for administrative documents. Leveraging pre-training on OCR-transcribed text and fine-tuning with an innovative integration of question-answer pairs, DocSum enhances summary accuracy and relevance. This approach tackles the complexities inherent in administrative content, ensuring outputs that align with real-world business needs. To evaluate its capabilities, we define a novel downstream task setting-Document Abstractive Summarization-which reflects the practical requirements of business and organizational settings. Comprehensive experiments demonstrate DocSum's effectiveness in producing high-quality summaries, showcasing its potential to improve decision-making and operational workflows across the public and private sectors.
翻译:抽象摘要技术在将大量文本压缩并重述为连贯摘要方面取得了显著进展。然而,由于领域特定术语、OCR生成的错误以及用于模型微调的标注数据集稀缺,行政文档的摘要生成面临独特挑战。现有模型往往难以适应此类文档的复杂结构和专业内容。为应对这些局限,我们提出了DocSum,一个专为行政文档设计的领域自适应抽象摘要框架。通过利用OCR转录文本进行预训练,并结合创新的问答对集成进行微调,DocSum提升了摘要的准确性和相关性。该方法处理了行政内容固有的复杂性,确保输出符合实际业务需求。为评估其能力,我们定义了一个新颖的下游任务设定——文档抽象摘要,该设定反映了商业和组织环境的实际需求。综合实验证明了DocSum在生成高质量摘要方面的有效性,展示了其在公共和私营部门提升决策制定与工作流程方面的潜力。