The occurrence of unknown words in texts significantly hinders reading comprehension. To improve accessibility for specific target populations, computational modelling has been applied to identify complex words in texts and substitute them for simpler alternatives. In this paper, we present an overview of computational approaches to lexical complexity prediction focusing on the work carried out on English data. We survey relevant approaches to this problem which include traditional machine learning classifiers (e.g. SVMs, logistic regression) and deep neural networks as well as a variety of features, such as those inspired by literature in psycholinguistics as well as word frequency, word length, and many others. Furthermore, we introduce readers to past competitions and available datasets created on this topic. Finally, we include brief sections on applications of lexical complexity prediction, such as readability and text simplification, together with related studies on languages other than English.
翻译:文本中陌生词汇的出现显著阻碍了阅读理解。为提高特定目标人群的可及性,计算建模已被应用于识别文本中的复杂词汇,并将其替换为更简单的替代词。本文综述了词汇复杂性预测的计算方法,重点关注基于英语数据的研究工作。我们梳理了解决该问题的相关方法,包括传统机器学习分类器(如支持向量机、逻辑回归)与深度神经网络,以及多种特征(例如受心理语言学文献启发的特征,以及词频、词长等)。此外,我们向读者介绍该主题下的过往竞赛与现有数据集。最后,我们简要探讨词汇复杂性预测的应用(如可读性与文本简化),并涉及英语以外的相关语言研究。