Conducting experiments with diverse participants in their native languages can uncover insights into culture, cognition, and language that may not be revealed otherwise. However, conducting these experiments online makes it difficult to validate self-reported language proficiency. Furthermore, existing proficiency tests are small and cover only a few languages. We present an automated pipeline to generate vocabulary tests using text from Wikipedia. Our pipeline samples rare nouns and creates pseudowords with the same low-level statistics. Six behavioral experiments (N=236) in six countries and eight languages show that (a) our test can distinguish between native speakers of closely related languages, (b) the test is reliable ($r=0.82$), and (c) performance strongly correlates with existing tests (LexTale) and self-reports. We further show that test accuracy is negatively correlated with the linguistic distance between the tested and the native language. Our test, available in eight languages, can easily be extended to other languages.
翻译:使用不同母语的多样受试者开展实验,可以揭示文化、认知和语言中那些传统方法难以发现的深层规律。然而,在线实验难以验证受试者自报的语言熟练度。此外,现有语言水平测试规模有限且仅覆盖少数语种。我们提出了一种基于维基百科文本自动生成词汇测试的系统化流程:该流程通过筛选罕见名词并创建具备相同底层统计特征的非词伪词。在六个国家、八种语言中进行的六项行为实验(N=236)表明:(a) 本测试可区分相近语言的母语者;(b) 测试信度良好($r=0.82$);(c) 测试成绩与现有测试(LexTale)及自评结果显著相关。我们进一步发现,测试准确率与受试语种和母语间的语言距离呈负相关。目前本测试已支持八种语言,并可便捷扩展至其他语言。