Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision, but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.
翻译:预训练语言模型是自然语言处理应用的基础。现有大多数预训练语言模型并未针对社交媒体上嘈杂的用户生成文本进行优化,且预训练过程未考虑社交网络中可用的宝贵社交参与日志。我们提出了TwHIN-BERT,这是一个在Twitter上产品化的多语言语言模型,基于该流行社交网络的领域内数据进行训练。与以往的预训练语言模型不同,TwHIN-BERT不仅通过基于文本的自监督学习进行训练,还结合了基于Twitter异构信息网络(TwHIN)中丰富社交参与的社交目标。我们的模型在覆盖超过100种不同语言的70亿条推文上进行了训练,为建模简短、嘈杂的用户生成文本提供了有价值的表示。我们在多种多语言社交推荐和语义理解任务上评估了该模型,并展示了相较于已有预训练语言模型的显著指标提升。我们将TwHIN-BERT及其精心整理的标签预测和社交参与基准数据集开源给研究社区。