TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.
翻译:TF-IDF是一种经典的公式,广泛用于识别文档中的重要术语。我们证明,类似TF-IDF的得分自然产生于捕捉词突发性(即词过度离散)的惩罚似然比检验设置的统计量中。在我们的框架中,备择假设通过一族贝塔-二项分布对文档集合进行建模,并在精度参数上施加伽马惩罚项,从而捕捉词突发性;相反,原假设假设文档集合中的词服从二项分布,这种建模方法未能解释词突发性。我们发现,由该检验统计量导出的词加权方案在文档分类任务上的表现与TF-IDF相当。本文从统计角度为TF-IDF提供了洞见,并强调了假设检验框架在推动词加权方案发展中的潜力。