Language modeling studies the probability distributions over strings of texts. It is one of the most fundamental tasks in natural language processing (NLP). It has been widely used in text generation, speech recognition, machine translation, etc. Conventional language models (CLMs) aim to predict the probability of linguistic sequences in a causal manner. In contrast, pre-trained language models (PLMs) cover broader concepts and can be used in both causal sequential modeling and fine-tuning for downstream applications. PLMs have their own training paradigms (usually self-supervised) and serve as foundation models in modern NLP systems. This overview paper provides an introduction to both CLMs and PLMs from five aspects, i.e., linguistic units, structures, training methods, evaluation methods, and applications. Furthermore, we discuss the relationship between CLMs and PLMs and shed light on the future directions of language modeling in the pre-trained era.
翻译:语言建模研究文本序列的概率分布,是自然语言处理领域最基础的任务之一,广泛应用于文本生成、语音识别、机器翻译等场景。传统语言模型旨在以因果方式预测语言序列的概率,而预训练语言模型则涵盖更广泛的概念,既可用于因果序列建模,也可通过微调适配下游应用。预训练语言模型拥有自监督训练范式,已成为现代自然语言处理系统的基础模型。本综述从语言单元、结构、训练方法、评估方法及应用五个维度介绍传统语言模型与预训练语言模型,并探讨二者关系,展望预训练时代语言建模的未来发展方向。