Despite the prevalence of pretrained language models in natural language understanding tasks, understanding lengthy text such as document is still challenging due to the data sparseness problem. Inspired by that humans develop their ability of understanding lengthy text from reading shorter text, we propose a simple yet effective summarization-based data augmentation, SUMMaug, for document classification. We first obtain easy-to-learn examples for the target document classification task by summarizing the input of the original training examples, while optionally merging the original labels to conform to the summarized input. We then use the generated pseudo examples to perform curriculum learning. Experimental results on two datasets confirmed the advantage of our method compared to existing baseline methods in terms of robustness and accuracy. We release our code and data at https://github.com/etsurin/summaug.
翻译:尽管预训练语言模型在自然语言理解任务中已广泛普及,但由于数据稀疏性问题,理解诸如文档等长文本仍具挑战性。受人类通过阅读较短文本发展长文本理解能力的启发,我们提出一种简单而有效的基于摘要的数据增强方法SUMMaug,用于文档分类。我们首先通过对原始训练样本的输入进行摘要处理,获取目标文档分类任务的易学习样本,同时可选地合并原始标签以匹配摘要后的输入。随后利用生成的伪样本进行课程学习。在两个数据集上的实验结果表明,与现有基线方法相比,我们的方法在鲁棒性和准确性方面具有优势。我们的代码和数据已在https://github.com/etsurin/summaug发布。