Language models have been foundations in various scenarios of NLP applications, but it has not been well applied in language variety studies, even for the most popular language like English. This paper represents one of the few initial efforts to utilize the NLP technology in the paradigm of World Englishes, specifically in creating a multi-variety corpus for studying Asian Englishes. We present an overview of the CCAE -- Corpus of Chinese-based Asian English, a suite of corpora comprising six Chinese-based Asian English varieties. It is based on 340 million tokens in 448 thousand web documents from six regions. The ontology of data would make the corpus a helpful resource with enormous research potential for Asian Englishes (especially for Chinese Englishes for which there has not been a publicly accessible corpus yet so far) and an ideal source for variety-specific language modeling and downstream tasks, thus setting the stage for NLP-based World Englishes studies. And preliminary experiments on this corpus reveal the practical value of CCAE. Finally, we make CCAE available at \href{https://huggingface.co/datasets/CCAE/CCAE-Corpus}{this https URL}.
翻译:语言模型已成为自然语言处理应用各场景的基础,但在语言变体研究中尚未得到良好应用,即便是英语这种最广泛使用的语言也是如此。本文是运用自然语言处理技术于世界英语研究范式的首批尝试之一,具体聚焦于构建多变体语料库以研究亚洲英语。我们呈现了CCAE——基于中文的亚洲英语语料库的总体概况,该语料库包含六种中文基底的亚洲英语变体,基于来自六个地区的44.8万份网络文档中的3.4亿个词元。数据的本体特征使该语料库成为具有巨大研究潜力的亚洲英语(尤其针对迄今尚无公开可及语料库的中国英语)辅助资源,并为特定变体的语言建模及下游任务提供理想数据源,从而为基于自然语言处理的世界英语研究奠定基础。初步实验揭示了CCAE的实用价值。最后,我们在\href{https://huggingface.co/datasets/CCAE/CCAE-Corpus}{此链接}开放了CCAE。