Common TF-IDF variants arise as key components in the test statistic of a penalized likelihood-ratio test for word burstiness

TF-IDF is a classical formula that is widely used for identifying important terms within documents. We show that TF-IDF-like scores arise naturally from the test statistic of a penalized likelihood-ratio test setup capturing word burstiness (also known as word over-dispersion). In our framework, the alternative hypothesis captures word burstiness by modeling a collection of documents according to a family of beta-binomial distributions with a gamma penalty term on the precision parameter. In contrast, the null hypothesis assumes that words are binomially distributed in collection documents, a modeling approach that fails to account for word burstiness. We find that a term-weighting scheme given rise to by this test statistic performs comparably to TF-IDF on document classification tasks. This paper provides insights into TF-IDF from a statistical perspective and underscores the potential of hypothesis testing frameworks for advancing term-weighting scheme development.

翻译：TF-IDF是一种经典的公式，广泛用于识别文档中的重要术语。我们证明，类似TF-IDF的得分自然产生于捕捉词突发性（即词过度离散）的惩罚似然比检验设置的统计量中。在我们的框架中，备择假设通过一族贝塔-二项分布对文档集合进行建模，并在精度参数上施加伽马惩罚项，从而捕捉词突发性；相反，原假设假设文档集合中的词服从二项分布，这种建模方法未能解释词突发性。我们发现，由该检验统计量导出的词加权方案在文档分类任务上的表现与TF-IDF相当。本文从统计角度为TF-IDF提供了洞见，并强调了假设检验框架在推动词加权方案发展中的潜力。

相关内容

TF-IDF

关注 0

TF-IDF（英语：term frequency–inverse document frequency）是一种用于信息检索与文本挖掘的常用加权技术。tf-idf是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外，互联网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜索结果中出现的顺序。

【NeurIPS2024】IDGen：基于项目区分度的提示生成用于大型语言模型评估

专知会员服务

14+阅读 · 2024年9月30日

【ICML2022】Branchformer:并行MLP-Attention架构，捕捉局部和全局上下文，用于语音识别和理解

专知会员服务

25+阅读 · 2022年7月8日

tf_geometric — 基于TensorFlow的友好高效的图神经网络（GNN）库

专知会员服务

26+阅读 · 2021年8月9日

TensorFlow GNN框架tf_geometric发布0.0.58版，支持稀疏节点特征

专知会员服务

12+阅读 · 2021年8月9日