Generating high-quality summaries for chat dialogs often requires large labeled datasets. We propose a method to efficiently use unlabeled data for extractive summarization of customer-agent dialogs. In our method, we frame summarization as a question-answering problem and use state-of-the-art large language models (LLMs) to generate pseudo-labels for a dialog. We then use these pseudo-labels to fine-tune a chat summarization model, effectively transferring knowledge from the large LLM into a smaller specialized model. We demonstrate our method on the \tweetsumm dataset, and show that using 10% of the original labelled data set we can achieve 65.9/57.0/61.0 ROUGE-1/-2/-L, whereas the current state-of-the-art trained on the entire training data set obtains 65.16/55.81/64.37 ROUGE-1/-2/-L. In other words, in the worst case (i.e., ROUGE-L) we still effectively retain 94.7% of the performance while using only 10% of the data.
翻译:生成高质量的聊天对话摘要通常需要大量标注数据集。我们提出了一种高效利用未标注数据的方法,用于客户与客服对话的抽取式摘要任务。在该方法中,我们将摘要任务重构为问答问题,并利用最先进的大语言模型(LLM)为对话生成伪标签。随后,我们使用这些伪标签微调聊天摘要模型,从而将大语言模型的知识有效迁移至更小的专用模型。我们在\tweetsumm数据集上验证了该方法,结果表明:仅使用原始标注数据集的10%时,ROUGE-1/-2/-L得分可分别达到65.9/57.0/61.0;而当前在完整训练数据集上训练的最先进模型得分分别为65.16/55.81/64.37。换言之,即使在最差情况(即ROUGE-L指标下),我们仍能在仅使用10%数据的情况下保留94.7%的性能。