Using Letter Positional Probabilities to Assess Word Complexity

Word complexity is defined in a number of different ways. Psycholinguistic, morphological and lexical proxies are often used. Human ratings are also used. The problem here is that these proxies do not measure complexity directly, and human ratings are susceptible to subjective bias. In this study we contend that some form of 'latent complexity' can be approximated by using samples of simple and complex words. We use a sample of 'simple' words from primary school picture books and a sample of 'complex' words from high school and academic settings. In order to analyse the differences between these classes, we look at the letter positional probabilities (LPPs). We find strong statistical associations between several LPPs and complexity. For example, simple words are significantly (p<.001) more likely to start with w, b, s, h, g, k, j, t, y or f, while complex words are significantly (p<.001) more likely to start with i, a, e, r, v, u or d. We find similar strong associations for subsequent letter positions, with 84 letter-position variables in the first 6 positions being significant at the p<.001 level. We then use LPPs as variables in creating a classifier which can classify the two classes with an 83% accuracy. We test these findings using a second data set, with 66 LPPs significant (p<.001) in the first 6 positions common to both datasets. We use these 66 variables to create a classifier that is able to classify a third dataset with an accuracy of 70%. Finally, we create a fourth sample by combining the extreme high and low scoring words generated by three classifiers built on the first three separate datasets and use this sample to build a classifier which has an accuracy of 97%. We use this to score the four levels of English word groups from an ESL program.

翻译：词汇复杂度的定义有多种方式。心理语言学、形态学和词汇代理变量常被用作指标，人类评分也被广泛采用。然而问题在于，这些代理变量并未直接测量复杂度，而人类评分易受主观偏差影响。本研究主张，通过简单词与复杂词的样本可近似估算某种形式的"潜在复杂度"。我们采用小学图画书中的"简单"词样本，以及高中与学术场景中的"复杂"词样本。为分析两类样本差异，我们考察了字母位置概率（LPP）。研究发现多个字母位置概率与复杂度之间存在显著统计关联：例如，简单词显著更可能以w、b、s、h、g、k、j、t、y或f开头（p<.001），而复杂词则显著更可能以i、a、e、r、v、u或d开头（p<.001）。在后续字母位置中同样发现显著关联，前6个位置共有84个字母-位置变量达到p<.001显著水平。我们基于字母位置概率构建分类器，可对两类样本实现83%的分类准确率。通过第二数据集验证发现，前6个位置中有66个字母位置概率变量在两组数据集中均达到p<.001显著水平。利用这66个变量构建的分类器对第三数据集的分类准确率达70%。最后，我们通过合并前三组数据集分类器所产生的高分与低分极端词样本构建第四样本集，基于此构建的分类器准确率提升至97%。我们将该分类器用于评估某英语作为第二语言教学项目的四个英语词群等级。