The obligatory use of third-person honorifics is a distinctive feature of several South Asian languages, encoding nuanced socio-pragmatic cues such as power, age, gender, fame, and social distance. In this work, (i) We present the first large-scale study of third-person honorific pronoun and verb usage across 10,000 Hindi and Bengali Wikipedia articles with annotations linked to key socio-demographic attributes of the subjects, including gender, age group, fame, and cultural origin. (ii) Our analysis uncovers systematic intra-language regularities but notable cross-linguistic differences: honorifics are more prevalent in Bengali than in Hindi, while non-honorifics dominate while referring to infamous, juvenile, and culturally exotic entities. Notably, in both languages, and more prominently in Hindi, men are more frequently addressed with honorifics than women. (iii) To examine whether large language models (LLMs) internalize similar socio-pragmatic norms, we probe six LLMs using controlled generation and translation tasks over 1,000 culturally balanced entities. We find that LLMs diverge from Wikipedia usage, exhibiting alternative preferences in honorific selection across tasks, languages, and socio-demographic attributes. These discrepancies highlight gaps in the socio-cultural alignment of LLMs and open new directions for studying how LLMs acquire, adapt, or distort social-linguistic norms. Our code and data are publicly available at https://github.com/souro/honorific-wiki-llm
翻译:第三人称敬语的强制性使用是多种南亚语言的显著特征,它编码了诸如权力、年龄、性别、名望和社会距离等微妙的社会语用线索。在本研究中:(i) 我们首次对印地语和孟加拉语维基百科中10,000篇文章的第三人称敬语代词和动词用法进行了大规模研究,其标注关联到主体的关键社会人口属性,包括性别、年龄组、名望和文化起源。(ii) 我们的分析揭示了系统性的语言内部规律,但也存在显著的跨语言差异:敬语在孟加拉语中的使用比在印地语中更为普遍,而在指称声名狼藉者、未成年人和文化异域实体时,非敬语形式占主导地位。值得注意的是,在两种语言中(印地语中更为明显),男性比女性更频繁地被使用敬语称呼。(iii) 为了探究大型语言模型(LLMs)是否内化了类似的社会语用规范,我们使用受控生成和翻译任务,对1,000个文化平衡的实体测试了六个LLMs。我们发现LLMs的用法与维基百科存在差异,在不同任务、语言和社会人口属性上表现出不同的敬语选择偏好。这些差异凸显了LLMs在社会文化对齐方面存在的差距,并为研究LLMs如何习得、适应或扭曲社会语言规范开辟了新的方向。我们的代码和数据已在 https://github.com/souro/honorific-wiki-llm 公开。