TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.
翻译:TF-IDF是一种广泛用于识别文档中重要术语的经典公式。我们证明,类似TF-IDF的得分会自然地从捕捉词突发性(也称为词过度离散)的惩罚似然比检验设置的检验统计量中涌现。在我们的框架中,备择假设通过根据具有伽马惩罚项(作用于精度参数)的贝塔-二项分布族对文档集合进行建模来捕捉词突发性。相比之下,零假设假设词在集合文档中呈二项分布,这种建模方法无法解释词突发性。我们发现,由该检验统计量引出的词加权方案在文档分类任务中的表现与TF-IDF相当。本文从统计视角深入揭示了TF-IDF的机制,并强调了假设检验框架在推动词加权方案发展方面的潜力。