Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
翻译:人类会根据交流对象调整自身的语言风格。然而,大型语言模型(LLMs)对不同社会语境的适应程度在很大程度上仍是未知的。随着这些模型日益成为人际交流的中介,若其无法适应多样化的语言风格,则可能固化刻板印象,并使语言规范与模型匹配度较低的社群边缘化,从而加剧社会分层。本研究探讨了LLMs在不同社会经济地位(SES)社群社交媒体交流中的融入程度。我们从Reddit和YouTube收集了一个按SES分层的新数据集,并使用该语料库中的不完整文本提示四种LLMs,将模型生成的续写文本与原始文本在94项社会语言学指标(包括句法、修辞和词汇特征)上进行比较。LLMs仅能对社会经济地位相关的语言风格进行微弱的调整,结果常表现为近似化或夸张化模仿,且往往更有效地模仿高SES群体的语言风格。我们的发现表明:(1)LLMs可能放大语言层级差异的风险;(2)对其在基于代理的社会模拟、调查实验以及任何依赖语言风格作为社会信号的研究中的有效性提出质疑。