The hypergeometric test performs comparably to a common TF-IDF variant on standard information retrieval tasks

Term frequency-inverse document frequency, or tf-idf for short, and its many variants form a class of term weighting functions the members of which are widely used in information retrieval applications. While tf-idf was originally proposed as a heuristic, theoretical justifications grounded in information theory, probability, and the divergence from randomness paradigm have been advanced. In this work, we present an empirical study showing that the hypergeometric test of statistical significance corresponds very nearly with a common tf-idf variant on selected real-data document retrieval and summarization tasks. These findings suggest that a fundamental mathematical connection between the tf-idf variant and the negative logarithm of the hypergeometric test P-value (i.e., a hypergeometric distribution tail probability) remains to be elucidated. We offer the empirical case study herein as a first step toward explaining the long-standing effectiveness of tf-idf from a statistical significance testing foundation.

翻译：词频-逆文档频率（简称TF-IDF）及其众多变体构成了一类广泛用于信息检索应用的词权重函数。尽管TF-IDF最初作为启发式方法提出，但基于信息论、概率论以及随机偏离范式等理论依据已得到不断补充。本研究通过实证分析表明，在选定的真实文档检索与摘要生成任务中，超几何统计显著性检验与常见的TF-IDF变体表现高度近似。这些发现提示，该TF-IDF变体与超几何检验P值（即超几何分布尾部概率）的负对数之间仍存在未被阐明的根本数学联系。本文提供的实证案例研究，旨在从统计显著性检验基础出发，迈出解释TF-IDF长久以来有效性的第一步。

相关内容

TF-IDF

关注 0

TF-IDF（英语：term frequency–inverse document frequency）是一种用于信息检索与文本挖掘的常用加权技术。tf-idf是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。tf-idf加权的各种形式常被搜索引擎应用，作为文件与用户查询之间相关程度的度量或评级。除了tf-idf以外，互联网上的搜索引擎还会使用基于链接分析的评级方法，以确定文件在搜索结果中出现的顺序。

【开放书】卡耐基梅隆大学Elaine Shi 教授《Foundations of Distributed Consensus and Blockchains（分布式共识和区块链的基础）》150页pdf

专知会员服务

30+阅读 · 2022年2月22日

因果图，Causal Graphs，52页ppt

专知会员服务

254+阅读 · 2020年4月19日

专知会员服务

46+阅读 · 2019年11月24日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation