Large Language Models (LLMs) have revolutionised the field of Natural Language Processing (NLP) and have achieved state-of-the-art performance in practically every task in this field. However, the prevalent approach used in text generation, Causal Language Modelling (CLM), which generates text sequentially from left to right, inherently limits the freedom of the model, which does not decide when and where each token is generated. In contrast, Masked Language Modelling (MLM), primarily used for language understanding tasks, can generate tokens anywhere in the text and any order. This paper conducts an extensive comparison of MLM and CLM approaches for text generation tasks. To do so, we pre-train several language models of comparable sizes on three different datasets, namely 1) medical discharge summaries, 2) movie plot synopses, and 3) authorship verification datasets. To assess the quality of the generations, we first employ quantitative metrics and then perform a qualitative human evaluation to analyse coherence and grammatical correctness. In addition, we evaluate the usefulness of the generated texts by using them in three different downstream tasks: 1) Entity Recognition, 2) Text Classification, and 3) Authorship Verification. The results show that MLM consistently outperforms CLM in text generation across all datasets, with higher quantitative scores and better coherence in the generated text. The study also finds \textit{no strong correlation} between the quality of the generated text and the performance of the models in the downstream tasks. With this study, we show that MLM for text generation has great potential for future research and provides direction for future studies in this area.
翻译:大语言模型(LLMs)革新了自然语言处理(NLP)领域,并在该领域几乎所有任务中取得了最先进的性能。然而,文本生成中普遍采用的因果语言建模(CLM)方法,即从左到右顺序生成文本,本质上限制了模型的自由度——模型无法决定每个词元(token)何时何地生成。相比之下,主要用于语言理解任务的掩码语言建模(MLM)可在文本任意位置以任意顺序生成词元。本文对MLM和CLM在文本生成任务中的性能进行了全面比较。为此,我们在三个不同数据集上预训练了多个规模相当的语言模型,包括:1)医疗出院摘要、2)电影情节梗概、以及3)作者身份验证数据集。为评估生成文本质量,我们首先采用量化指标,随后进行定性人工评估以分析连贯性与语法正确性。此外,我们通过将生成文本应用于三个不同的下游任务来评估其实用性:1)实体识别、2)文本分类、以及3)作者身份验证。结果表明,在所有数据集上,MLM在文本生成中持续优于CLM,生成文本具有更高的量化得分和更好的连贯性。研究还发现生成文本质量与模型在下游任务中的表现之间“无强相关性”。通过本研究,我们证明了MLM在文本生成领域具有巨大的未来研究潜力,并为该领域的后续研究提供了方向。