Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence (neology) identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts (Ryskina et al., 2020, arXiv:2001.07740). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different neologism formation mechanisms.
翻译:活语言受到多种相互冲突的内外部演化压力塑造。其中部分压力在语言和文化间具有普遍性,另一些则因社会与对话语境而异:报纸语言使用所受约束与社交媒体语言使用存在显著差异。先前关于英语词汇新生现象(neology)的分布语义研究,通过分析主要由历史出版文本构成的语料库,识别出两个与新词创造相关的因素(Ryskina等人,2020,arXiv:2001.07740)。本文将该方法扩展至上下文嵌入向量(除静态向量外),并应用于新的Twitter推文语料库,研究表明相同发现在两个领域均成立,尽管话题流行度增长因子对Twitter新词创造的贡献可能低于出版写作。我们推测这种差异可解释为两个领域倾向于不同的新词形成机制。