Pre-trained language models have recently emerged as a powerful tool for fine-tuning a variety of language tasks. Ideally, when models are pre-trained on large amount of data, they are expected to gain implicit knowledge. In this paper, we investigate the ability of pre-trained language models to generalize to different non-language tasks. In particular, we test them on tasks from different domains such as computer vision, reasoning on hierarchical data, and protein fold prediction. The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results. They all have similar performance and they outperform transformers that are trained from scratch by a large margin. For instance, pre-trained language models perform better on the Listops dataset, with an average accuracy of 58.7\%, compared to transformers trained from scratch, which have an average accuracy of 29.0\%. The significant improvement demonstrated across three types of datasets suggests that pre-training on language helps the models to acquire general knowledge, bringing us a step closer to general AI. We also showed that reducing the number of parameters in pre-trained language models does not have a great impact as the performance drops slightly when using T5-Small instead of T5-Base. In fact, when using only 2\% of the parameters, we achieved a great improvement compared to training from scratch. Finally, in contrast to prior work, we find out that using pre-trained embeddings for the input layer is necessary to achieve the desired results.
翻译:预训练语言模型近年来已成为微调各类语言任务的有力工具。理想情况下,当模型在大量数据上进行预训练时,它们应能获得隐含知识。本文探究了预训练语言模型泛化至不同非语言任务的能力。具体而言,我们在计算机视觉、层级数据推理和蛋白质折叠预测等不同领域的任务上对其进行了测试。我们使用的四种预训练模型——T5、BART、BERT和GPT-2——均取得了显著成果。它们性能相似,且大幅优于从头训练的Transformer模型。例如,预训练语言模型在Listops数据集上的表现更佳,平均准确率达58.7%,而从头训练的Transformer模型平均准确率仅为29.0%。在三种类型数据集上展现的显著改进表明,语言预训练帮助模型获取了通用知识,使我们向通用人工智能迈进了一步。我们还发现,减少预训练语言模型的参数量并未产生重大影响——当使用T5-Small替代T5-Base时,性能仅略有下降。事实上,仅使用2%的参数,我们便取得了相比从头训练的显著提升。最后,与先前研究不同,我们发现使用预训练嵌入作为输入层是实现预期结果的必要条件。