We propose a resampling-based approach for assessing keyness in corpus linguistics based on suggestions by Gries (2006, 2022). Traditional approaches based on hypothesis tests (e.g. Likelihood Ratio) model the copora as independent identically distributed samples of tokens. This model does not account for the often observed uneven distribution of occurences of a word across a corpus. When occurences of a word are concentrated in few documents, large values of LLR and similar scores are in fact much more likely than accounted for by the token-by-token sampling model, leading to false positives. We replace the token-by-token sampling model by a model where corpora are samples of documents rather than tokens, which is much closer to the way corpora are actually assembled. We then use a permutation approach to approximate the distribution of a given keyness score under the null hypothesis of equal frequencies and obtain p-values for assessing significance. We do not need any assumption on how the tokens are organized within or across documents, and the approach works with basically *any* keyness score. Hence, appart from obtaining more accurate p-values for scores like LLR, we can also assess significance for e.g. the logratio which has been proposed as a measure of effect size. An efficient implementation of the proposed approach is provided in the `R` package `keyperm` available from github.
翻译:我们基于Gries(2006,2022)的建议,提出一种重采样方法来评估语料库语言学中的关键词度。基于假设检验的传统方法(如似然比检验)将语料库建模为词元的独立同分布样本。该模型未考虑一个词在语料库中常出现的不均匀分布现象。当某个词的出现集中在少数文档中时,LLR等分数的高值实际上比词元逐位采样模型所预期的更为常见,从而导致假阳性。我们用一种将语料库视为文档样本而非词元样本的模型替代词元逐位采样模型,这更接近语料库的实际构建方式。随后,我们采用置换方法在频率相等的原假设下逼近给定关键词度分数的分布,并获取用于评估显著性的p值。我们无需对词元在文档内部或跨文档的排列方式作任何假设,该方法几乎适用于任何关键词度分数。因此,除了能获得更精确的LLR等分数的p值外,我们还可以评估对数比值(已被提议作为效应量指标)等指标的显著性。所提方法的有效实现已集成在GitHub提供的`R`包`keyperm`中。