Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance, and therefore, a need to revisit the quality of training samples and the learning protocol of deep learning models is a must. In this paper, we aim to discretize the vector space of the abstractive text summarization models to understand the characteristics learned between the input embedding space and the models' encoder space. We show that deep models fail to capture the diversity of the input space. Further, the distribution of data points on the encoder space indicates that an unchecked increase in the training samples does not add value; rather, a tear-down of data samples is highly needed to make the models focus on variability and faithfulness. We employ clustering techniques to learn the diversity of a model's sample space and how data points are mapped from the embedding space to the encoder space and vice versa. Further, we devise a metric to filter out redundant data points to make the model more robust and less data hungry. We benchmark our proposed method using quantitative metrics, such as Rouge, and qualitative metrics, such as BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit the models from learning the diversity from the varied input samples.
翻译:抽象式文本摘要正随着训练样本数量的激增而蓬勃发展,以满足深度学习模型的需求。这些模型倾向于通过提升生成摘要的量化要素来利用训练数据表示以达到更优性能。然而,增大训练集规模并非总是实现性能最大化的理想方案,因此必须重新审视训练样本的质量与深度学习模型的学习范式。本文旨在对抽象式文本摘要模型的向量空间进行离散化处理,以理解输入嵌入空间与模型编码器空间之间习得的特征。我们证明深度模型未能捕获输入空间的多样性。此外,编码器空间上的数据点分布表明,无节制增加训练样本并无增益;相反,亟需精简数据样本以促使模型聚焦于变异性和忠实度。我们采用聚类技术学习模型样本空间的多样性,并探究数据点从嵌入空间到编码器空间及其逆映射的映射方式。进一步,我们设计了一种度量标准来过滤冗余数据点,使模型更具鲁棒性且降低对数据量的依赖。我们采用Rouge等量化指标以及BERTScore、FEQA和Pyramid score等定性指标对所提方法进行基准测试,同时量化了阻碍模型从多样化输入样本中学习多样性的原因。