In long document controllable summarization, where labeled data is scarce, pretrained models struggle to adapt to the task and effectively respond to user queries. In this paper, we introduce Socratic pretraining, a question-driven, unsupervised pretraining objective specifically designed to improve controllability in summarization tasks. By training a model to generate and answer relevant questions in a given context, Socratic pretraining enables the model to more effectively adhere to user-provided queries and identify relevant content to be summarized. We demonstrate the effectiveness of this approach through extensive experimentation on two summarization domains, short stories and dialogue, and multiple control strategies: keywords, questions, and factoid QA pairs. Our pretraining method relies only on unlabeled documents and a question generation system and outperforms pre-finetuning approaches that use additional supervised data. Furthermore, our results show that Socratic pretraining cuts task-specific labeled data requirements in half, is more faithful to user-provided queries, and achieves state-of-the-art performance on QMSum and SQuALITY.
翻译:在长文档可控摘要任务中,由于标注数据稀缺,预训练模型难以有效适应任务需求并响应用户查询。本文提出苏格拉底式预训练(Socratic Pretraining),这是一种专为提升摘要任务可控性而设计的无监督问答驱动预训练目标。通过训练模型在给定上下文中生成并回答相关问题,苏格拉底式预训练使模型能够更有效地遵循用户提供的查询,并识别需摘要的相关内容。我们通过在短篇故事和对话两个摘要领域,以及关键词、问题和事实型问答对等多种控制策略上的广泛实验,证明了该方法的有效性。该预训练方法仅依赖无标注文档和问题生成系统,其性能优于使用额外监督数据的预微调方法。此外,我们的结果表明,苏格拉底式预训练可将任务特定标注数据需求减半,更能忠实遵循用户查询,并在QMSum和SQuALITY基准上达到了当前最优性能。