Evolving linguistic divergence on polarizing social media

Language change is influenced by many factors, but often starts from synchronic variation, where multiple linguistic patterns or forms coexist, or where different speech communities use language in increasingly different ways. Besides regional or economic reasons, communities may form and segregate based on political alignment. The latter, referred to as political polarization, is of growing societal concern across the world. Here we map and quantify linguistic divergence across the partisan left-right divide in the United States, using social media data. We develop a general methodology to delineate (social) media users by their political preference, based on which (potentially biased) news media accounts they do and do not follow on a given platform. Our data consists of 1.5M short posts by 10k users (about 20M words) from the social media platform Twitter (now "X"). Delineating this sample involved mining the platform for the lists of followers (n=422M) of 72 large news media accounts. We quantify divergence in topics of conversation and word frequencies, messaging sentiment, and lexical semantics of words and emoji. We find signs of linguistic divergence across all these aspects, especially in topics and themes of conversation, in line with previous research. While US American English remains largely intelligible within its large speech community, our findings point at areas where miscommunication may eventually arise given ongoing polarization and therefore potential linguistic divergence. Our methodology - combining data mining, lexicostatistics, machine learning, large language models and a systematic human annotation approach - is largely language and platform agnostic. In other words, while we focus here on US political divides and US English, the same approach is applicable to other countries, languages, and social media platforms.

翻译：语言变化受多种因素影响，但通常始于共时变异——即多种语言模式或形式共存，或不同言语社群以日益分化的方式使用语言。除地区或经济因素外，社群可能基于政治立场形成并产生区隔。后者被称为政治极化，已成为全球日益关注的社会问题。本研究利用社交媒体数据，描绘并量化了美国政治左右两派间的语言分歧。我们开发了一套通用方法，通过用户在特定平台上关注（或不关注）的（可能具有倾向性的）新闻媒体账号来界定其政治偏好。研究数据包含来自社交媒体平台Twitter（现更名为"X"）的1.5万条短帖（约2000万字），由1万名用户发布。为界定样本，我们挖掘了该平台上72个大型新闻媒体账号的粉丝列表（共4.22亿用户）。我们量化了对话主题与词频、信息情感倾向，以及词汇与表情符号的语义特征等方面的分歧。研究发现，所有这些维度均存在语言分歧迹象，尤其在对话主题与议题方面，这与先前研究结果一致。尽管美国英语在其庞大的言语社群内部仍基本保持可理解性，但我们的研究结果揭示了在持续极化及潜在语言分歧背景下，可能最终导致沟通误解的领域。我们的方法论——融合数据挖掘、词汇统计学、机器学习、大语言模型及系统化人工标注——具有高度的语言与平台无关性。换言之，尽管本研究聚焦于美国政治分歧与美国英语，但同一方法同样适用于其他国家、语言及社交媒体平台。