Exploiting Representation Bias for Data Distillation in Abstractive Text Summarization

Abstractive text summarization is surging with the number of training samples to cater to the needs of the deep learning models. These models tend to exploit the training data representations to attain superior performance by improving the quantitative element of the resultant summary. However, increasing the size of the training set may not always be the ideal solution to maximize the performance, and therefore, a need to revisit the quality of training samples and the learning protocol of deep learning models is a must. In this paper, we aim to discretize the vector space of the abstractive text summarization models to understand the characteristics learned between the input embedding space and the models' encoder space. We show that deep models fail to capture the diversity of the input space. Further, the distribution of data points on the encoder space indicates that an unchecked increase in the training samples does not add value; rather, a tear-down of data samples is highly needed to make the models focus on variability and faithfulness. We employ clustering techniques to learn the diversity of a model's sample space and how data points are mapped from the embedding space to the encoder space and vice versa. Further, we devise a metric to filter out redundant data points to make the model more robust and less data hungry. We benchmark our proposed method using quantitative metrics, such as Rouge, and qualitative metrics, such as BERTScore, FEQA and Pyramid score. We also quantify the reasons that inhibit the models from learning the diversity from the varied input samples.

翻译：抽象式文本摘要正随着训练样本数量的激增而蓬勃发展，以满足深度学习模型的需求。这些模型倾向于利用训练数据的表征，通过改进生成摘要的量化要素来获得卓越性能。然而，增大训练集规模并非总是提升性能的理想解决方案，因此必须重新审视训练样本的质量和深度学习模型的学习协议。本文旨在离散化抽象式文本摘要模型的向量空间，以理解输入嵌入空间与模型编码器空间之间所学习到的特征。研究表明，深度模型未能充分捕捉输入空间的多样性。此外，编码器空间中的数据点分布表明，不加控制地增加训练样本并无裨益；相反，亟需精简数据样本，以使模型聚焦于变异性和忠实度。我们采用聚类技术来学习模型样本空间的多样性，以及数据点如何从嵌入空间映射到编码器空间及其逆映射。进一步地，我们设计了一个指标来过滤冗余数据点，从而使模型更鲁棒且减少对数据的依赖。我们使用定量指标（如Rouge）和定性指标（如BERTScore、FEQA和Pyramid分数）对所提出的方法进行基准测试。同时，我们还量化了阻止模型从多样化输入样本中学习多样性的原因。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日