We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score.
翻译:我们研究了基于Transformer的语言模型(LM)理解社交媒体语言的能力。社交媒体语言与标准书面语言存在显著差异,然而现有基准测试未能充分衡量LM在这一对社会、经济和政治具有重要意义的领域中的表现。我们量化了社交媒体语言与传统语言的差异程度,并得出结论:无论是在词元分布方面,还是在语言演变速率方面,这种差异均具有显著性。接着,我们推出了一个新的社交媒体语言评估基准(SMILE),涵盖四个社交媒体平台和十一项任务。最后,我们证明,通过学习分词器并在社交媒体与传统语言混合语料上进行预训练,所得到的语言模型在SMILE整体评分上比同等规模的最佳替代模型高出4.2个百分点。