Variation in language use, shaped by speakers' sociocultural background and specific context of use, offers a rich lens into cultural perspectives, values, and opinions. For example, Chinese students discuss "healthy eating" with words like "timing," "regularity," and "digestion," whereas Americans use vocabulary like "balancing food groups" and "avoiding fat and sugar," reflecting distinct cultural models of nutrition. The computational study of these Sociocultural Linguistic Phenomena (SLP) has traditionally been done in NLP via tailored analyses of specific groups or topics, requiring specialized data collection and experimental operationalization--a process not well-suited to quick hypothesis exploration and prototyping. To address this, we propose constructing a "sandbox" designed for systematic and flexible sociolinguistic research. Using our method, we construct a demographically/topically split Reddit dataset, Splits!, validated by self-identification and by replicating several known SLPs from existing literature. We showcase the sandbox's utility with a scalable, two-stage process that filters large collections of "potential" SLPs (PSLPs) to surface the most promising candidates for deeper, qualitative investigation.
翻译:语言使用中的变异受到使用者社会文化背景及具体使用情境的影响,为理解文化视角、价值观与观点提供了丰富的研究窗口。例如,中国学生在讨论"健康饮食"时常用"定时""规律""消化"等词汇,而美国学生则倾向于使用"平衡食物类别"、"避免脂肪与糖分"等表述,这反映出截然不同的营养文化认知模式。传统上,自然语言处理领域对社会文化语言学现象的计算研究,往往通过针对特定群体或主题的定制化分析展开,需要专门的数据收集与实验操作化设计——这种研究范式难以快速进行假设探索与原型验证。为突破这一局限,我们提出构建一个专为系统性、灵活性的社会语言学研究而设计的"沙盒"工具。通过该方法,我们构建了基于人口统计学特征与话题维度的Reddit数据集Splits!,并通过自我认同验证及对现有文献中若干已知社会文化语言学现象的复现,完成了数据集的效度检验。我们进一步通过一个可扩展的两阶段流程——从海量"潜在社会文化语言学现象"中筛选最具研究价值的问题候选——展示了该沙盒工具在深度质性研究中的应用潜力。