Pre-trained large language models (PLMs) underlie most new developments in natural language processing. They have shifted the field from application-specific model pipelines to a single model that is adapted to a wide range of tasks. Autoregressive PLMs like GPT-3 or PaLM, alongside techniques like few-shot learning, have additionally shifted the output modality to generation instead of classification or regression. Despite their ubiquitous use, the generation quality of language models is rarely evaluated when these models are introduced. Additionally, it is unclear how existing generation tasks--while they can be used to compare systems at a high level--relate to the real world use cases for which people have been adopting them. In this work, we discuss how to adapt existing application-specific generation benchmarks to PLMs and provide an in-depth, empirical study of the limitations and capabilities of PLMs in natural language generation tasks along dimensions such as scale, architecture, input and output language. Our results show that PLMs differ in their applicability to different data regimes and their generalization to multiple languages and inform which PLMs to use for a given generation task setup. We share best practices to be taken into consideration when benchmarking generation capabilities during the development of upcoming PLMs.
翻译:预训练大型语言模型(PLMs)是自然语言处理领域最新发展的核心驱动力。它们将原本面向具体应用的模型流水线,转变为通过单一模型适应各类任务的范式。以GPT-3和PaLM为代表的自回归型PLMs,结合小样本学习等技术,进一步将输出模态从分类或回归转向生成任务。尽管这些模型被广泛使用,但人们很少在模型发布时评估其生成质量。此外,现有生成任务虽能用于高阶系统比较,但其与用户实际应用场景的关联性尚不明确。本研究探讨如何将现有应用场景的生成基准适配至PLMs,并沿规模、架构、输入输出语言等维度,对PLMs在自然语言生成任务中的局限性与能力开展深度实证研究。结果表明:PLMs在不同数据分布下的适用性及其多语言泛化能力存在显著差异,这为特定生成任务场景的模型选择提供了依据。我们总结了在开发新型PLMs过程中,针对生成能力进行基准测试时应遵循的最佳实践准则。