This work tackles the task of extractive text summarization in a limited labeled data scenario using a semi-supervised approach. Specifically, we propose a prompt-based pseudolabel selection strategy using GPT-4. We evaluate our method on three text summarization datasets: TweetSumm, WikiHow, and ArXiv/PubMed. Our experiments show that by using an LLM to evaluate and generate pseudolabels, we can improve the ROUGE-1 by 10-20\% on the different datasets, which is akin to enhancing pretrained models. We also show that such a method needs a smaller pool of unlabeled examples to perform better.
翻译:本文针对有限标注数据场景下的抽取式文本摘要任务,提出了一种基于半监督学习的方法。具体而言,我们设计了一种基于提示词的伪标签选择策略,并采用GPT-4实现该策略。我们在三个文本摘要数据集(TweetSumm、WikiHow和ArXiv/PubMed)上进行了方法评估。实验结果表明,通过利用大语言模型评估并生成伪标签,我们能够在不同数据集上将ROUGE-1指标提升10-20%,这一效果堪比对预训练模型的增强。研究同时表明,本方法仅需更少的无标注样本池即可获得更优性能。