In this paper, we introduce a sociolinguistic perspective on language modeling. We claim that large language models are inherently models of varieties of language, and we consider how this insight can inform the development and deployment of large language models. We begin by presenting a technical definition of the concept of a variety of language as developed in sociolinguistics. We then discuss how this perspective can help address five basic challenges in language modeling: social bias, domain adaptation, alignment, language change, and scale. Ultimately, we argue that it is crucial to carefully define and compile training corpora that accurately represent the specific varieties of language being modeled to maximize the performance and societal value of large language models.
翻译:本文从社会语言学视角探讨语言建模问题。我们主张大型语言模型本质上是对语言变体的建模,并探讨这一观点如何指导大型语言模型的开发与应用。首先,我们基于社会语言学理论给出语言变体概念的技术性定义。随后,我们讨论该视角如何帮助应对语言建模中的五个基本挑战:社会偏见、领域适应、对齐问题、语言演变与规模效应。最终我们论证:为最大化大型语言模型的性能与社会价值,必须精确定义并构建能准确表征目标语言变体的训练语料库。