The impact of text length on the estimation of lexical diversity has captured the attention of the scientific community for more than a century. Numerous indices have been proposed, and many studies have been conducted to evaluate them, but the problem remains. This methodological review provides a critical analysis not only of the most commonly used indices in language learning studies, but also of the length problem itself, as well as of the methodology for evaluating the proposed solutions. The analysis of three datasets of English language-learners' texts revealed that indices that reduce all texts to the same length using a probabilistic or an algorithmic approach solve the length dependency problem; however, all these indices failed to address the second problem, which is their sensitivity to the parameter that determines the length to which the texts are reduced. The paper concludes with recommendations for optimizing lexical diversity analysis.
翻译:文本长度对词汇多样性估计的影响已引起科学界一个多世纪的关注。学界提出了众多指数,并开展了大量研究对其进行评估,但问题依然存在。本方法论综述不仅对语言学习研究中最常用的指数进行了批判性分析,还深入探讨了长度问题本身,以及评估所提解决方案的方法论。对三组英语学习者文本数据集的分析表明,采用概率或算法方法将所有文本统一为相同长度的指数能够解决长度依赖性问题;然而,所有这些指数均未能解决第二个问题,即它们对确定文本缩减长度的参数敏感。本文最后提出了优化词汇多样性分析的建议。