Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.

翻译：人类会根据交流对象调整自身的语言风格。然而，大型语言模型（LLMs）对不同社会语境的适应程度在很大程度上仍是未知的。随着这些模型日益成为人际交流的中介，若其无法适应多样化的语言风格，则可能固化刻板印象，并使语言规范与模型匹配度较低的社群边缘化，从而加剧社会分层。本研究探讨了LLMs在不同社会经济地位（SES）社群社交媒体交流中的融入程度。我们从Reddit和YouTube收集了一个按SES分层的新数据集，并使用该语料库中的不完整文本提示四种LLMs，将模型生成的续写文本与原始文本在94项社会语言学指标（包括句法、修辞和词汇特征）上进行比较。LLMs仅能对社会经济地位相关的语言风格进行微弱的调整，结果常表现为近似化或夸张化模仿，且往往更有效地模仿高SES群体的语言风格。我们的发现表明：（1）LLMs可能放大语言层级差异的风险；（2）对其在基于代理的社会模拟、调查实验以及任何依赖语言风格作为社会信号的研究中的有效性提出质疑。

相关内容

语言变体

关注 0

语言变体是社会语言学研究的重要课题。R.A.赫德森（Richard Hudson）把语言变体定语为“社会分布相似的一套语项”。意指是由具备相同社会特征的人在相同的社会环境中所普遍使用的某种语言表现形式。“语言变体”是一个内涵很宽泛的概念，大至一种语言的各种方言，小至一种方言中某一项语音、词汇或句法特征，只要有一定的社会分布的范围，就是一种语言变体。语言的变体受到复杂的社会因素制约，社会语言学对语言变体的研究一般认为，讲话人的社会阶级（Class）和讲话风格（Style）是语言变体的重要基础，而讲话人的性别对语言变体也产生重要影响。根据使用者来划分的变体叫方言，根据语言使用来划分的变体叫语体或语域。

大型语言模型的规模效应局限

专知会员服务

14+阅读 · 2025年11月18日

赋能大型语言模型多领域资源挑战

专知会员服务

11+阅读 · 2025年6月10日

大语言模型与小语言模型协同机制综述

专知会员服务

40+阅读 · 2025年5月15日

面向统计学家的大型语言模型概述

专知会员服务

32+阅读 · 2025年3月16日