Towards Better Inclusivity: A Diverse Tweet Corpus of English Varieties

The prevalence of social media presents a growing opportunity to collect and analyse examples of English varieties. Whilst usage of these varieties was - and, in many cases, still is - used only in spoken contexts or hard-to-access private messages, social media sites like Twitter provide a platform for users to communicate informally in a scrapeable format. Notably, Indian English (Hinglish), Singaporean English (Singlish), and African-American English (AAE) can be commonly found online. These varieties pose a challenge to existing natural language processing (NLP) tools as they often differ orthographically and syntactically from standard English for which the majority of these tools are built. NLP models trained on standard English texts produced biased outcomes for users of underrepresented varieties. Some research has aimed to overcome the inherent biases caused by unrepresentative data through techniques like data augmentation or adjusting training models. We aim to address the issue of bias at its root - the data itself. We curate a dataset of tweets from countries with high proportions of underserved English variety speakers, and propose an annotation framework of six categorical classifications along a pseudo-spectrum that measures the degree of standard English and that thereby indirectly aims to surface the manifestations of English varieties in these tweets. Following best annotation practices, our growing corpus features 170,800 tweets taken from 7 countries, labeled by annotators who are from those countries and can communicate in regionally-dominant varieties of English. Our corpus highlights the accuracy discrepancies in pre-trained language identifiers between western English and non-western (i.e., less standard) English varieties. We hope to contribute to the growing literature identifying and reducing the implicit demographic discrepancies in NLP.

翻译：社交媒体的普及为收集和分析英语变体实例提供了日益增长的机会。尽管这些变体的使用过去（并且在许多情况下现在仍然）仅局限于口语语境或难以获取的私人信息中，但像推特这样的社交媒体平台为用户提供了一种以可抓取格式进行非正式交流的媒介。值得注意的是，印度英语（Hinglish）、新加坡英语（Singlish）和非裔美国人英语（AAE）在网络上普遍存在。这些变体对现有的自然语言处理工具构成了挑战，因为它们在拼写和句法上往往与标准英语不同，而大多数工具正是基于标准英语构建的。在标准英语文本上训练的自然语言处理模型会对使用较少代表性变体的用户产生偏差结果。一些研究尝试通过数据增强或调整训练模型等技术来克服因数据不具代表性而导致的固有偏差。我们旨在从根源——即数据本身——来解决偏差问题。我们整理了一个来自高比例未充分服务英语变体使用者国家的推文数据集，并提出了一种包含六种类别分类的标注框架，该框架沿着一个伪光谱测量标准英语的程度，从而间接地旨在揭示这些推文中英语变体的表现形式。遵循最佳标注实践，我们不断扩充的语料库包含来自7个国家的17.08万条推文，由来自这些国家且能够使用地区主导英语变体进行交流的标注人员进行标注。我们的语料库凸显了预训练语言标识符在西方英语和非西方（即不那么标准的）英语变体之间的准确性差异。我们希望为日益增长的识别并减少自然语言处理中隐性人口偏差的研究文献做出贡献。