State-of-the-art pretrained language models tend to perform below their capabilities when applied out-of-the-box on tasks that require understanding and working with numbers. Recent work suggests two main reasons for this: (1) popular tokenisation algorithms have limited expressiveness for numbers, and (2) common pretraining objectives do not target numeracy. Approaches that address these shortcomings usually require architectural changes or pretraining from scratch. In this paper, we propose a new extended pretraining approach called Arithmetic-Based Pretraining that jointly addresses both in one extended pretraining step without requiring architectural changes or pretraining from scratch. Arithmetic-Based Pretraining combines contrastive learning to improve the number representation, and a novel extended pretraining objective called Inferable Number Prediction Task to improve numeracy. Our experiments show the effectiveness of Arithmetic-Based Pretraining in three different tasks that require improved numeracy, i.e., reading comprehension in the DROP dataset, inference-on-tables in the InfoTabs dataset, and table-to-text generation in the WikiBio and SciGen datasets.
翻译:当前最先进的预训练语言模型在直接应用于需要理解和处理数字的任务时,其表现往往低于其潜力。近期研究表明,这一现象主要源于两个原因:(1) 流行的分词算法对数字的表达能力有限,(2) 常见的预训练目标并未针对数值能力进行优化。针对这些缺陷的现有方法通常需要修改架构或从头开始预训练。本文提出一种名为"基于算术的预训练"的新型扩展预训练方法,该方法通过单一扩展预训练步骤同时解决上述两个问题,且无需修改架构或从头预训练。该技术将对比学习用于改进数字表征,并设计了一种名为"可推理数字预测任务"的创新扩展预训练目标以提升数值能力。实验表明,基于算术的预训练在三个需要增强数值能力的任务中均有效:DROP数据集中的阅读理解、InfoTabs数据集中的表格推理,以及WikiBio和SciGen数据集中的表格到文本生成。