This evidence-based position paper critiques current research practices within the language model pre-training literature. Despite rapid recent progress afforded by increasingly better pre-trained language models (PLMs), current PLM research practices often conflate different possible sources of model improvement, without conducting proper ablation studies and principled comparisons between different models under comparable conditions. These practices (i) leave us ill-equipped to understand which pre-training approaches should be used under what circumstances; (ii) impede reproducibility and credit assignment; and (iii) render it difficult to understand: "How exactly does each factor contribute to the progress that we have today?" We provide a case in point by revisiting the success of BERT over its baselines, ELMo and GPT-1, and demonstrate how -- under comparable conditions where the baselines are tuned to a similar extent -- these baselines (and even-simpler variants thereof) can, in fact, achieve competitive or better performance than BERT. These findings demonstrate how disentangling different factors of model improvements can lead to valuable new insights. We conclude with recommendations for how to encourage and incentivize this line of work, and accelerate progress towards a better and more systematic understanding of what factors drive the progress of our foundation models today.
翻译:这篇基于证据的立场论文批判了当前语言模型预训练文献中的研究实践。尽管近年来得益于性能日益提升的预训练语言模型(PLMs),研究取得了快速进展,但当前PLM研究实践常混淆模型改进的不同可能来源,未进行充分的消融研究,也未在可比较的条件下对不同模型进行原则性对比。这些实践(一)使我们难以理解在何种情况下应使用何种预训练方法;(二)阻碍了可重复性和贡献归属;以及(三)导致难以理解:“每个因素究竟如何促成我们今天所见的进步?”我们通过重新审视BERT相对于其基线模型ELMo和GPT-1的成功,提供一个典型实例,并证明——在基线模型经过类似程度调优的可比条件下——这些基线模型(甚至其更简单的变体)实际上可以实现与BERT相媲美甚至更优的性能。这些发现表明,厘清模型改进的不同因素能够带来宝贵的新见解。最后,我们提出建议,以鼓励和激励这类研究工作,并加速推进对我们基础模型当前进步驱动因素更优、更系统化的理解。