This paper presents the methodology and data used for the automatic extraction of the Romanian Academic Word List (Ro-AWL). Academic Word Lists are useful in both L2 and L1 teaching contexts. For the Romanian language, no such resource exists so far. Ro-AWL has been generated by combining methods from corpus and computational linguistics with L2 academic writing approaches. We use two types of data: (a) existing data, such as the Romanian Frequency List based on the ROMBAC corpus, and (b) self-compiled data, such as the expert academic writing corpus EXPRES. For constructing the academic word list, we follow the methodology for building the Academic Vocabulary List for the English language. The distribution of Ro-AWL features (general distribution, POS distribution) into four disciplinary datasets is in line with previous research. Ro-AWL is freely available and can be used for teaching, research and NLP applications.
翻译:本文介绍了用于自动提取罗马尼亚学术词汇表(Ro-AWL)的方法与数据。学术词汇表在第二语言与第一语言教学场景中均有实用价值。目前罗马尼亚语尚缺乏此类资源。Ro-AWL通过融合语料库语言学与计算语言学方法,结合第二语言学术写作路径生成。我们使用两类数据:(a) 现有数据,如基于ROMBAC语料库的罗马尼亚词频表; (b) 自建数据,如专家学术写作语料库EXPRES。在构建学术词汇表时,我们沿袭英语学术词汇表的构建方法论。Ro-AWL特征(总体分布、词性分布)在四个学科数据集中的分布规律与既有研究相符。Ro-AWL可免费获取,适用于教学、研究及自然语言处理应用场景。