Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

翻译：TF-IDF是一种广泛用于识别文档中重要术语的经典公式。我们证明，类似TF-IDF的得分会自然地从捕捉词突发性（也称为词过度离散）的惩罚似然比检验设置的检验统计量中涌现。在我们的框架中，备择假设通过根据具有伽马惩罚项（作用于精度参数）的贝塔-二项分布族对文档集合进行建模来捕捉词突发性。相比之下，零假设假设词在集合文档中呈二项分布，这种建模方法无法解释词突发性。我们发现，由该检验统计量引出的词加权方案在文档分类任务中的表现与TF-IDF相当。本文从统计视角深入揭示了TF-IDF的机制，并强调了假设检验框架在推动词加权方案发展方面的潜力。

相关内容

TF-IDF

关注 0

TF-IDF（英语：term frequency–inverse document frequency）是一种用于信息检索与文本挖掘的常用加权技术。tf-idf是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外，互联网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜索结果中出现的顺序。

【Facebook AI】fastText是一个用于高效学习单词表示和句子分类的库

专知会员服务

22+阅读 · 2022年3月25日

复旦大学邱锡鹏等《自然语言处理范式迁移综述》论文，详述7大NLP范式：分类、匹配、SeqLab, MRC, Seq2Seq等

专知会员服务

54+阅读 · 2021年9月29日

tf_geometric — 基于TensorFlow的友好高效的图神经网络（GNN）库

专知会员服务

26+阅读 · 2021年8月9日

TensorFlow GNN框架tf_geometric发布0.0.58版，支持稀疏节点特征

专知会员服务

12+阅读 · 2021年8月9日